Introduction

对于语义分割 DCNN 存在一些问题
(1)reduced feature resolution, (2) existence of objects at multiple scales, and (3) reduced localization accuracy due to DCNN invariance. Next, we discuss these challenges and our approach to overcome them in our proposed DeepLab system。

第一个挑战是max-pooling和downsampling(striding) reduced feature resolution
我们去除了 DCNN 后面的一些 max-pooling layers
并且用空洞卷积代替上采样

第二个挑战是物体的尺度变化性
我们参考金字塔池化提出了“atrous spatial pyramid pooling”

第三个挑战是DCNN的平移不变性对语义分割造成的不利因素
我们使用CRF来解决这个问题

DeepLab整体流程如图所示

DeepLab的优势
(1) Speed: by virtue of atrous convolution, our dense DCNN operates at 8 FPS on an NVidia Titan X GPU, while Mean Field Inference for the fully-connected CRF requires 0.5 secs on a CPU.
(2) Accuracy: we obtain state-of-art results on several challenging datasets, including the PASCAL VOC 2012 semantic segmentation benchmark [34], PASCAL-Context [35], PASCALPerson-Part[36],andCityscapes[37].
(3)Simplicity:our system is composed of a cascade of two very well-established modules, DCNNs and CRFs.

RELATED WORK

End-to-end training
[65] unroll the CRF mean-ﬁeld inference steps to convert the whole system into an end-to-end trainable feed-forward network

CRF可以融合到DNN实现end-to-end训练

Weaker supervision

atrous convolution

空洞卷积的图示

METHOD

Atrous Convolution for Dense Feature Extraction and Field-of-View Enlargement

对于一维的情况，空洞卷积的计算公式
r = rate 采样间隔

对于二维图像，其效果如下

我们采用了混合上采样的方法，空洞卷积x4，双线性插值x8
双线性插值的效果是可以保证的，因为score map是平滑的，如图5

We have adopted instead a hybrid approach that strikes a good efﬁciency/accuracy trade-off, using atrous convolution to increase by a factor of 4 the density of computed feature maps, followed by fast bilinear interpolation by an additional factor of 8 to recover feature maps at the original image resolution.

Multiscale Image Representations using Atrous Spatial Pyramid Pooling

为了解决尺度变化问题我们提出了下面的结构
用多尺度的空洞卷积做上采样

Structured Prediction with Fully-Connected Conditional Random Fields for Accurate Boundary Recovery

A trade-off between localization accuracy and classiﬁcation performance seems to be inherent in DCNNs
更深的模型+max pooling对分类更友好
但是其带来的平移不变性和the large receptive ﬁelds of top-level nodes can only yield smooth responses

Traditionally, conditional random ﬁelds (CRFs) have been employed to smooth noisy segmentation maps [23], [31].
而我们的目标是恢复局部结构而不是平滑
所以我们用全连接的CRF来做后处理

其能量公式是

where x is the label assignment for pixels. We use as unary potential θi(xi) = −logP(xi), where P(xi) is the label assignment probability at pixel i as computed by a DCNN.

where µ(xi,xj) = 1 if xi 6= xj, and zero otherwise, which, as in the Potts model, means that only nodes with distinct labels are penalized. The remaining expression uses two Gaussian kernels in different feature spaces; the ﬁrst, ‘bilateral’ kernel depends on both pixel positions (denoted as p) and RGB color (denoted as I), and the second kernel only depends on pixel positions. The hyper parameters σα, σβ and σγ control the scale of Gaussian kernels.

对于分类不同的像素点，施加一个基于位置距离和RGB色彩距离的惩罚
即相对距离越大能量越大

效果图