Natural Language Explanations for Fine-grained Image Classification


We propose the first method to produce deep visual explanations using natural language justifications. Our vision and language explanation model combines classification and sentence generation and incorporates a loss function operating over sampled sentences. Our results on a fine-grained bird species classification dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods. Importantly, there is no need of having ground-truth textual explanations available at training time.

Intended Use

This software will be of interest to researchers and developers working on fine-grained recognition, in particular in scenarios where no expert annotations are available or are easy to obtain.

The proposed model could be applied to any fine-grained classification dataset that has some natural language descriptions associated with the images (they do not need to be expert explanations).


The input is an image and a predicted class label. The output is a natural language visual explanation (both visually accurate and class-discriminative).

The model is trained on a parallel corpus of images with assicialted textual descriptions (they do not need to be explanations).


The model requires a parallel corpus of images with associated textual descriptions.


title={Generating Visual Explanations}, 
author={Hendricks, Lisa Anne and Akata, Zeynep and Rohrbach, Marcus and Donahue, Jeff and Schiele, Bernt and Darrell, Trevor}, 
conference={Proceedings of the European Conference on Computer Vision (ECCV)},