AVE-2: Audio-Visual Events Dataset Hugging Face
Curated dataset for audio-visual event understanding and cross-modal alignment. Enhanced version of the original AVE dataset with improved annotations and multimodal grounding.
VERIFY: Visual Explanation and Reasoning Benchmark Hugging Face
Comprehensive benchmark for investigating multimodal reasoning fidelity. Tests whether multimodal large language models can provide consistent explanations and reasoning across modalities.
PW-VQA: Possible Worlds Visual Question Answering GitHub
Causal evaluation suite for cross-modal bias in visual question answering. Investigates spurious correlations and reasoning robustness in vision-language models.
MMPerspective Dataset GitHub
Benchmark for perspective perception, reasoning, and robustness evaluation in multimodal large language models. Tests spatial understanding across visual modalities.