SoundCLIP: Unified Audio-Visual Understanding Live Demo • GitHub
Adapt LLaVA latent space to ingest audio alongside video. Interactive demonstration of unified multimodal token alignment for audio-visual understanding tasks.
AVVA-Curation: Audio-Video Vector Alignment Project Site • GitHub
LLM-gated curation system for large-scale audiovisual data. Quality-over-quantity approach for training data-efficient audio-video foundation models.
PW-VQA: Possible Worlds Visual Question Answering GitHub
Causal VQA benchmark for investigating cross-modal bias. Interactive evaluation framework for testing multimodal reasoning fidelity.
MMPerspective: Multimodal Perspective Understanding GitHub
Comprehensive benchmark for perspective perception, reasoning, and robustness in multimodal large language models.