1.相关信息

Scott Reed1, Zeynep Akata2, Honglak Lee1 and Bernt Schiele2
1University of Michigan 2Max-Planck Institute for Informatics

http://www.scottreed.info/

CVPR2016
In IEEE Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016.

2.摘要

最先进的zero-shot视觉识别将学习视为图像和补充信息的联合问题。其中对视觉特征来说最有效的补充信息是属性-描述类与类之间的共享特征的手动编码向量。尽管算法表现很好，但是属性任然是有局限的：
（1）更细粒度的识别需要相当多的属性
（2）属性不提供自然语言界面（attributes do not provide a natural language interface）（不能显式的表示？）
作者通过从头开始训练一个没有预先训练，只考虑文字和字符的自然语言模型来打破这些局限。作者提出一个使得细粒度和特定类别相一致的端到端的模型（Our proposed models train end-to-end to align with the fine-grained and category-specific content of images. )。自然语言提供了一种灵活而紧凑的方式来编码能显著区分类别的视觉特征。该模型在zero-shot的基于文本的图像检索方面展现了强大的性能，并且在Caltech-UCSD Birds200-2011数据集上的zero-shot分类方面明显优于基于属性的最新技术

3. 主要贡献

First, we collected two datasets of fine-grained visual descriptions: one for the Caltech-UCSD birds dataset, and another for the
Oxford-102 flowers dataset [32]. Both our data and code will be made available. Second, we propose a novel extension of structured joint embedding [2], and show that it can be used for end-to-end training of deep neural language models. It also dramatically improves zero-shot retrieval performance for all models. Third, we evaluate several variants of word- and character-based neural language models, including our novel hybrids of convolutional and recurrent networks for text modeling. We demonstrate significant improvements over the state-of-the-art on CUB and Flowers datasets in both zero-shot recognition and retrieval.

4.主要思想

（1）The CUB dataset also has per-image attributes, but we found that using these does not improve performance compared to using a single averaged attribute vector per class.