Introduction

Title: Structure Correcting Adversarial Network for Organ Segmentation in Chest X-rays
Author: Wei Dai, Joseph Doyle, Xiaodan Liang, Hao Zhang, Nanqing Dong, Yuan Li, Eric P. Xing Petuum Inc.
arXiv: https://arxiv.org/abs/1703.08770

本文主要是一篇关于对双肺和心脏进行语义分割的论文，作者认为器官语义分割是针对胸片（CXR）构建计算机辅助诊断系统的重要一步，器官的区域提供了丰富的结构信息，可用于诊断许多病症。而目前胸片又因辐射小、花费低，而十分普遍，给放射科工作者带来了巨大的工作量。所以本文的研究具有现实意义。同时该研究也存在着巨大的挑战，CXR为2d灰度图片，且目前公开数据集数据量很少（多只有几百张），无法直接应用在大规模数据集上训练好的网络模型。作者据此提出了SCAN框架，该模型采用了GAN（生成对抗网络）的思想，包含了一个分割网络(segmentation network)和一个判别网络(critic network)，采用零和博弈的思想，在公开数据集JSRT和Montgomery上进行单独交替训练。这两个网络都是一个复杂的神经网络，包含FCN、和VGG-based（VGG基础上进行修改）、残差块(residual block)。这是一个数据依赖性小（不依赖大规模数据）、参数量小的模型，取得了一个高准确率（人类专家水平）、高效率（<1s）、迁移性强（泛化能力强）的结果，超过该研究领域的state-of-the-art Registration-based approach。

Keyword

Adversarial Network（对抗网络）
- critic network（判别网络）
- segmentaiton network（分割网络）
Organ Segmentation（器官图像分割）
Chest X-rays (CXR)
Structure Correcting（通过critic network 获取 Global structure information 全局结构信息）
FCN + GAN

Main Work

Propose Structure Correcting Adversarial Network(SCAN) to segment lung fields and the heart in CXR images（提出SCAN框架）
- a critic network: learns to discriminate between the ground truth organ annotations from the masks synthesized by the segmentation network during trainning; learns the higher order regularities and effectively transfers this global information back to the segmentation model to achieve realistic segmentation outcome（critic network帮助学习到高层的结构信息，单靠分割模型会面临训练样本量不足问题）
- segmentation model : convolutional network
- end-to-end（端到端）
The model produces highly accurate and natural segmentation. （高准确率）
- 94.7% IoU for lung fields (human experts: 94.6%)
- 86.6% IoU for heart fields (human experts: 87.8%)
- Surpass current state-of-the-art（超越当前最高水平）
Using only very limited trainning data availabel, the model reaches human-level performance without relying on any existing trained model or dataset.（数据依赖性小，可达人类识别水平）
- SCAN model is more robust when applied to a new, unseen dataset, outperforming the vanilla segmentation model by 4.3%

Background

Motivation

Chest X-ray (CXR) ofen with over 2-10x more scans than other imaging modalities such as MRI, CT scan and PET scans due to its low cost and low dose of radiation. It is asignificant workloads on radiologists and medical practitioners.（CXR花费低，辐射少，导致数量多, 工作负担重）
- In 2015/16, in UK’s public medical sector: 22.5M X-ray images (8M CXR), 4.5M CT, 3.1M MRI
- Shortage of radilogists in the world
Organ segmentation is a crucial step to obtain effecive computer-aided detection on CXR.（器官分割CXR计算机辅助诊断重要一步）
- The segmentation of the lung fields and the heart provides rich structure information about shape irregularities and size measurements that can be used to directly assess certain serious clinical conditions, such as cardiomegaly（心肥大）, pneumothorax（气胸）, pleural effusion（胸腔积液）, emphysema（肺气肿）.
- Explicit lung region masks can improve interpretability of computer-aided detection, which is important fir clinical use.

CXR有着花费低，辐射少的有点，但是同时也导致了数量多, 放射科工作者工作负担重的问题。能对CXR进行器官语义分割是构建计算机辅助诊断系统的重要步骤，通过器官结构信息可以发现许多病症的存在。因此本文的研究是有现实意义的。

Challenge

X-rays have low resolution and 2-D projection compared with the more modern medical imaging technologies such as CT scan and PET scans.（X光分辨率低，2d成像）
Very limited CXR trainning data with pixel-level annotations due to expense（像素级标注的CXR训练数据很少）
CXRs exhibit substantial variations across different patient populations, pathological conditions, imaging technology and operation（CXR样本差异性大）
CXR images are gray-scale and drastically different from natural images（CXR图是灰度图，现有模型可迁移性差）
to incorporate the implicit mdedical knowledge involved in contour determination.（如何将医学知识融入边缘判定）
- medical experts look for certain consistent structures surrounding the lung fields while annotating the lung fields.（医学专家在标定边缘的时候会寻找特定结构，如aortic arch（主动脉弓），cardiodiaphragmatic angles（心隔角））
- Therefore， a successful segmentation model must effectively leverage global structual information to resolve the local details.（可突破点：应用全局结构信息）
high contrast between rib cage and lung fields.

针对CXR图片进行器官语义分割，存在着一定的困难。如CXR图片缺少颜色信息，无法直接使用基于ImageNet的pre-train model，相比于MRI,CT等3d图片比也CXR只是2d图像，包含的信息更少。且目前相关的公开数据集数据量很小，且样本差异性较大，在现有网络模型上训练容易产生过拟合等问题。因此需要一种能够利用全局结构信息与局部结构信息结合的框架，且能够克服上述困难。

Lung Field Segmentation

Current state-of-the-art

Registration-based approach: to build a lung model for a test patient, finds patients in an existing database that are most similar to the test patient and perform linear deformation of their lung profiles based on key point matching.（比较法；关键点匹配）

Semantic Segmentation with Convolutional Networks

Aims to assign a pre-defined class to each pixel

Current state-of-the art

Fully convolutional network (FCN)
Improvement: Semantic segmentation using adversarial networks

We note that there is a growing body of recent works that apply neural networks end-to-end on CXR images [25, 34]. These models directly output clinical targets such as disease labels without well-deﬁned intermediate outputs to aid interpretability. Furthermore, they generally require a large number of CXR images for training, which is not readily available for many clinical tasks involving CXR images.

目前一些成果的不足：结果未输出辅助性中间结果，直接输出标签，且需要大量训练数据。

Problem Definition

CXRs in the posteroanterior (PA 由后向前) view
Lung fields definition: lung fileds consist of all the pixels for which radiation passes through the lung but not through the following structures, the heart, the mediastinum（纵膈，介于两肺之间的不透明区域）, below the diaphragm（膈）, the aorta（主动脉）, and if visible, the superior vena cava（上腔静脉）.
The heart boundary is generally visible on two sides, while the top and bottom borders of the heart have to be inferred due to occlusion by the mediastinum（心脏左右边界通常可见，上下边界被纵膈遮挡需要推测）

定义了CXR拍摄方向为PA和肺区域和心脏区域的定义。

Structure Correcting Adversarial Network (SCAN)

Authors adapt FCNs to gray-scale CXR images uder the stringent constraint of very limited trainning dataset of 247 images. It departs from the usual VGG architecture and can be trained without transfer learning from existing models or dataset.

论文方法提出的方法SCAN：FCN+对抗网络，仅需要少量训练数据，不依赖现有模型或数据库

Adversarial Training for Smeantic Segmentation

GAN

Adversarial trainning was first proposed in Generative Adversarial Network (GAN)

a generator network: learn the data distribution
a critic network: estimates the probability that a sample comes from the tranning data instead of synthesized by the generator
Adversarial process: The generator’s objective is to maximize the probability that the critic makes a mistake, while the critic is optimized to minimize the chance of mistake.
The critic, which itself can be a complex neural network, can learn to exploit higher order inconsistencies in the samples synthesized by the generator.

Use the critic to learn these higher order structures and guide the segmentation network to generate masks more consistent with the learned global structures.

Key: 利用判别模型来学习高阶的结构信息来指导分割网络学习到全局结构信息。

Training Objectives

Data

$S$: segmentation network
$D$: critic network
$x_i$: input image, shape $[H,W,1]$ for a single-channel gray-scale image with heigh $H$ and width $W$
$y_i$: the associated mask labels, shape $[H,W,C]$ where $C$ is the number of classes including the background.
- for each pixel location $(j,k)$, $y_i^{jkc}=1$ for the labeled class $c$ while the rest of the channels are zero($y_i^{jkc’}=0$ for $c’ \neq c$).
$S(x) \in \lbrace 0, 1 \rbrace ^{\lbrace H,W,C \rbrace}$: denote the class probabilities predicted by $S$ at each pixel location such that class probailities normalize to 1 at each pixel.（$S(x)$:通过S预测的每一个像素点每个类的概率）
$D(x_i, y)$: scalar probability estimate of $y$ coming from the traning data (ground truth) $y_i$ instead of the predicted mask $S(x_i)$ （$D(x_i,y)$: $y$来自训练数据(ground truth)$y_i$而非$S(x_i)$的概率）

Optimization problem

Eq.(1):
$$\begin{equation}
\min_S \max_D \lbrace J(S,D):=\sum_{i=1}^N J_s(S(x_i), y_i) - \lambda [J_d(D(x_i, y_i), 1) + J_d(D(x_i, S(x_i)),0)] \rbrace
\end{equation}$$

固定$S$，针对$D$（max下标），最大化$J(S,D)$
固定$D$，针对$S$最大化$J(S,D)$,
$ J_s(\hat y, y) := \frac{1}{HW} \sum_{j,k} \sum_{c=1}^C-y^{jkc} \ln y^{jkc}$: multi-class cross-entropy loss for predicted mask $\hat y$ averaged over all pixels.
$J_d(\hat t, t):= -t\ln \hat t + (1-t) \ln(1-\hat t)$ : binary logistic loss for the critic’s predition
$\lambda$ : tuning parameter balancing pixel-wise loss and the adversarial loss
We can solve Eq.(1) by alternate between optimizing $S$ and optimizing $D$ using their respective loss functions.（训练方法：单独交替迭代训练）

上述公式可以拆分为下面两个阶段：

Trainning the Critic

Train the critic network by minimizing the following objective with respect to $D$ for a fixed $S$:
$$ \sum_{i=1}^N J_d(D(x_i, y_i), 1) + J_d(D(x_i, S(x_i)),0) $$
相比于Eq(1) 优化公式，少了负号，所以变成了最小化问题。

Trainning the Segmentation Network

Given a fixed D, we train the segmentation network by minimizing the following objective with respect to $S$:
$$ \sum_{i=1}^N J_s(S(x_i),y_i) + \lambda J_d(D(x_i,S(x_i)),0)$$

Use $J_d(D(x_i, S(x_i)),1)$ in place of $-J_d(D(x_i, S(x_i)),0)$, for $J_d(D(x_i, S(x_i)),0)$ leads to weaker gradient signals when $D$ makes accurate predictions.

参考

GAN理解: http://blog.csdn.net/on2way/article/details/72773771?locationNum=7&fps=1
minimax：https://en.wikipedia.org/wiki/Minimax
零和博弈(zero-sum game)
minimizing the possible loss for a worst case (maximum loss) scenario

Segmentation Network

FCN

The down-sampling path(下采样) 类似图像分类网络架构
- convolutional layers
- max/average pooling layer
- VGG-based
- residual block architecture
The up-sampling path(上采样)
- convolutional layers
- deconvolutional layers(transposed convolution) 反卷积层
Most FCNs are applied to color images with RGB channels which this model cannot use.
3 classes
- the left lung
- the right lung
- the heart
247 CXR images

Critic Network

input: 4 or 5(including input image) channels
segmentation network
global average pool
fully connected layer

Experiments

Dataset and Processing

Dataset

Use two publicly available dataset with at least lung field annotations.

JSRT

Released by Japanese Society of Radiological Technology (JSRT)
247 CXRs (154 have lung nodules and 93 have no lung nodule)
Resolution: $2048 \times 2048$
gray-scale with color depth of 12 bits.
represents mostly normal lung and heart masks (lung nodules in most cases do not alter the counter of the lungs and heart

Montgomery

Department of Health and Human Services, Montgomer Country, Marland, USA
138 CXRs (80 normal patients and 58 patients with manifested tuberculosis(TB肺结核))
Resolution: $4020 \times 4892$ or $4892 \times 4020$
gray-scale with color depth of 12 bits.
Only the two lung masks annotations are available

Processing

scale all images to $ 400 \times 400$ pixels(with sufficient visual details for vascular structures)
$800 \times 800$ does not improve the segmentation performance
image normalization : for given image $x$
$$x^{jk} := \frac{x^{jk} - \hat x}{\sqrt{var(x)}}$$
- $\hat x$: mean of pixels in $x$
- $var(x)$: variance of pixels in $x$
- do not use statistics from the whole dataset（取单张图片均值和方差非整个数据集）
post-processing: fill in any hole in the predicted mask, and remove the small patches disjoint from the largest mask
- PS: In practice, this is important for the predition output of the segmentation network (FCN alone), but dose not affect the evalutation results for FCN with adversarial trainning*（post-prcessing对FCN有效，对FCN对抗网络无提升效果）

Training Protocols

GANs are unstable during the training process
pre-train the segmentation network using only the pixel-wise loss $J_s$
- faster
- do not train critic network
Adam optimizer with learning rate 0.0002 to train all models for 350 epochs
mini-batch size : 10
with critic network : perform 5 optimization steps on the segmentation for each optimization steps on the critic network ( 5次segementation，1次critic）
evaluation: IoU (Intersection-over-Union)
- $P$: the set of pixels in the predicted segmentation mask for a class
- $G$: the set of pixels in the ground truth mask for a class
- $IoU=\frac{|P \cap G|}{|P \cup G|} = \frac{|TP|}{|TP|+|FP|+|FN|}$
- Dice Coefficient: $ \frac{2|P \cap G|}{|P + G|} = \frac{2|TP|}{2|TP|+|FP|+|FN|}$

Experiment Design and Result

Design

JSRT
- development set: 209 images (randomly)
- evaluation set: 38 images
- tune hyperparameters (such as $\lambda$ in Eq.(1)) using a validation set within development set
Montgomery
- development set: 117 images(randomly)
- evaluation set: 21 images
- use the same hyperparameters tuned in JSRT

Performance

Compare FCN with SCAN on JSRT

2. Compare to existing methods on JSRT

* current state-of-the-art: registration-based

3. on Different dataset（迁移性） * different population * train on the full JSRT and test on the full montgomery * 单纯使用FCN数据集迁移性不佳

time efficiency

Reference

简单理解与实验生成对抗网络GAN
生成式模型 & 生成对抗网络——资料梳理（专访资料 + 论文分类）
图像语义分割之FCN和CRF
wikipedia: minimax
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D.Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014. 3, 4
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. 3, 4, 5

[论文笔记] SCAN

Introduction

Keyword

Main Work

Background

Motivation