Multi-step salience with neural modular networks


Existing models designed to produce interpretable traces of their decision-making process typically require these traces to be supervised at training time. We present a novel neural modular approach that performs compositional reasoning by automatically inducing a desired sub-task decomposition without relying on strong supervision. That is, our approach decomposes a task into multiple steps, and provides visual explanations for the various sub-tasks. Our model allows linking different reasoning tasks though shared modules that handle common routines across tasks.

Intended Use

This software will be of interest to AI researchers interested in explainability for complex reasoning and those interested in modular (inherently explainable) architectures.

The proposed method does not require strong expert supervision of the reasoning steps. It has been shown applicable to two different problems (VQA and referring expression comprehension) and could be extended to other related tasks that require multi-step reasoning.


Depending on a task, the input to the model is an image and a question/query, while the output is an interpretable multi-step salience and an answer/bounding box.

The model is trained on two tasks: a visual question answering dataset and a referring expression comprehension dataset.


The accuracy of the model is slightly below that of a non-modular (not explainable) state-of-the-art model.


  title={Explainable Neural Computation via Stack Neural Module Networks},
  author={Hu, Ronghang and Andreas, Jacob and Darrell, Trevor and Saenko, Kate},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},