Datasets & Benchmarks

  • AVE-2: Audio-Visual Events Dataset
    Hugging Face
    Curated dataset for audio-visual event understanding and cross-modal alignment. Enhanced version of the original AVE dataset with improved annotations and multimodal grounding.
  • VERIFY: Visual Explanation and Reasoning Benchmark
    Hugging Face
    Comprehensive benchmark for investigating multimodal reasoning fidelity. Tests whether multimodal large language models can provide consistent explanations and reasoning across modalities.
  • PW-VQA: Possible Worlds Visual Question Answering
    GitHub
    Causal evaluation suite for cross-modal bias in visual question answering. Investigates spurious correlations and reasoning robustness in vision-language models.
  • MMPerspective Dataset
    GitHub
    Benchmark for perspective perception, reasoning, and robustness evaluation in multimodal large language models. Tests spatial understanding across visual modalities.