ABC-CNN阅读笔记

前言

本篇记录的是关于2015年论文ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering的阅读笔记。

问题背景

视觉问答（Visual Question Answering，VQA），是一种涉及计算机视觉和自然语言处理的学习任务。这一任务的定义如下： A VQA system takes as input an image and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output。翻译为中文：一个VQA系统以一张图片和一个关于这张图片形式自由、开放式的自然语言问题作为输入，以生成一条自然语言答案作为输出。简单来说，VQA就是给定的图片进行问答。

正文

本文中会明显区别图像特征(image feature) 和视觉特征(visual feature) 两词

Abstract Note

摘要部分中，论文阐述的重点是针对VQA问题，一般方法是增强对与问题相关的图像区域的关注(quesion-quided attention)，ABC-CNN中提出使用基于注意力模型的可配置卷积神经网络(attention based configurable convolution neural network)去提取输入图像的关注区域。

ABC-CNN determines
an attention map for an image-question pair by convolving
the image feature map with configurable convolutional
kernels derived from the question’s semantics.

即ABC-CNN的可配置卷积 k 核由输入的question的问题语义(question’s semantics)驱动和配置，从输入的image中提取image feature map并将其与卷积核 k 进行卷积操作，通过卷积的结果得到对应的输入image-question pair对应的attention map。

Introduction Note

该部分提出的几个重点只要是：

使用ABC-CNN作为一个统一的框架整合VQA问题中的视觉和语义信息。

use ABC-CNN as a unified framework to integrate the visual and semantic infomation for VQA.
整个ABC-CNN的实现思路是由问题驱动，寻找关注区域，即通过输入的问题，应用注意力模型寻找输入图片中的关注区域，提取特征并进行进一步的学习。
ABC-CNN的组成为四个部分：vision part, question understand part, answer generation part, attention extraction part/vision and question understanding part。其中各个部分的特点为：
- vision part：使用CNN从图像中提取visual features，提取空间特征图(spatial features map)而不是提取单独的全局视觉特征(single global visual feature)
- question understand part：该部分使用LSTM模型[1]获得问题的嵌入(question embeddings[2])
- answer generation part: 使用简单的多类分类器生成answer
ABC-CNN模型的核心是将提取得到的 question-guided attention infomation 表现为 question-guided attention map。实现的方法是将输入的question从semantic space 映射到 isual space得到question embedding，再使用question embedding 配置卷积核 k ，通过 k 与image卷积操作得到question-guided attention map

It(question-guided attention map) is is achieved via a configurable convolutional neural network,where the convolutional kernels are generated by projecting the question embeddings from the semantic space into the visual space.
ABC-CNN认为得到的卷积核 k 与由question语义决定的视觉信息有相关或一致关系
question-guided attention map反映的是image中各区域对目标answer的重要程度。可用于计算图中各区域的空间权重(spatially weight)，过滤图片image中与问题无关的区域/噪音

个人对于这部分的一些疑问

如何得到question embeddings,因为对NLP相关内容了解的不深入，对如何通过问题的嵌入得到question在问题语义空间到视觉特征空间的映射存疑
对于得到的卷积核 k 是否能有效表征特定对象的 visual features 存疑

framework image

这部分主要介绍了VQA问题与图像描述问题的背景、注意力模型的主要思想和可配置卷积神经网络。

Attention models

这部分介绍了attention model在从图像中提取特征的方法，即使用RNN从输入image中提取一系列提议区域(proposal region)，通过从decoding LSTM 输出的隐藏状态和从提议区域中提取的视觉特征学习得到需要的attention weights

Attention Based Configurable CNN Note

ABC-CNN模型的核心是attention extraction part，卷积核 k 将与图像特征(image features)相作用得到question-guided attention ma，其中卷积核 k 能表征question所需的视觉特征

Attention Extraction

卷积核 k
$$k = σ(W_{sk}s + b_k), σ(x) = \frac{1}{1 + e−x}$$
其中σ为sigmoid函数，s为question中对应对象的语义特征信息
question-guided attention map m
$$m_{ij} = P(ATT_{ij}|I,s) = \frac{e^{z_{ij}}}{\sum_i\sum_j e^{z_{ij}}} ， z = k * I$$
其中I为图像特征(image features)

Question Understanding

LSTM for query processing

$$i_t = \sigma(W_{vi}v_t + W_{hi}h_{t-1} + b_i)$$

$$f_t = \sigma(W_{vf}v_t + W_{hf}h_{t-1} + b_f)$$

$$o_t = \sigma(W_{vo}v_t + W_{ho}h_{t-1} + b_o)$$

$$g_t = \phi(W_{vg}v_t + W_{hg}h_{t-1} + b_g)$$

$$c_t = f_t\odot c_{t-1} + v_i\odot g_t$$

$$h_t = o_t \odot \phi(c_t)$$

其中

$\phi$为hyperbolic tangent function
LSTM的输入是question q
question q 的语言信息 s 通过对LSTM的输出 h 学习得到

在此附上个人理解该部分的另一种形式的数据图
question understanding

Image Feature Extraction

该部分主要说明如何对输入的image进行处理得到image features

操作方法是将输入的WHD图片分割为NN的栅格区域，对每个栅格区域提取特征得到最后NN*D的image feature map

image feature extraction

Answer Generation

answer generation

这部分的多类分类器(mutil-class classifier)基于三个输入：

原始图像特征图 I (original image feature map)
密集问题嵌入 s (dense question embedding)
注意力权重特征图 I’ (attention weighted feature map)

为了避免过拟合，该部分还使用1*1的卷积操作使 I’ 的通道数减少，得到$I_r$用于计算最后结果answer

相关公式

$$I’_i = I_i \odot m$$

$$h = g(W_{ih}I + W_{rh}I_r + W_{sh}S + b_h)$$

其中

$m$为question-guided attention map
$I’$为得到的attention weighted feature map
$g(.)$为element-wise scaled hyperbolic tangent function:$g(x) = 1.7159\cdot tanh(\frac{2}{3}x)$
h 为question-image pair提取的最终特征(final projected feature)
多类分类器使用softmax作为计算方法

引用内容

结语

初学深度学习，如果有发现文中错误的地方，欢迎斧正

前言