Datasets with multimodal explanations


This is the first dataset that provides multimodal human ground-truth explanations (both textual and visual) for two down-stream tasks, namely visual question answering (VQA-X) and activity recognition (ACT-X).

Intended Use

This dataset will be of interest to AI researchers interested in explainability in multimodal contexts, serving as a benchmark for evaluating the correctness of the generated explanations.


Our datasets define visual and textual justifications of a classification decision for activity recognition tasks (ACT-X) and for visual question answering tasks (VQA-X).


  title={Multimodal explanations: Justifying decisions and pointing to the evidence},
  author={Park, Dong Huk and Hendricks, Lisa Anne and Akata, Zeynep and Rohrbach, Anna and Schiele, Bernt and Darrell, Trevor and Rohrbach, Marcus},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},