Datasets with multimodal explanations
Overview
This is the first dataset that provides multimodal human ground-truth explanations (both textual and visual) for two down-stream tasks, namely visual question answering (VQA-X) and activity recognition (ACT-X).
Intended Use
This dataset will be of interest to AI researchers interested in explainability in multimodal contexts, serving as a benchmark for evaluating the correctness of the generated explanations.
Model/Data
Our datasets define visual and textual justifications of a classification decision for activity recognition tasks (ACT-X) and for visual question answering tasks (VQA-X).
References
@inproceedings{park2018multimodal,
title={Multimodal explanations: Justifying decisions and pointing to the evidence},
author={Park, Dong Huk and Hendricks, Lisa Anne and Akata, Zeynep and Rohrbach, Anna and Schiele, Bernt and Darrell, Trevor and Rohrbach, Marcus},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={8779--8788},
year={2018}
}