PromptReverb: Text-to-Spatial-Audio Generation Project Site
A multimodal generative model that conditions on natural language to synthesize immersive responses, enabling multimedia experiences to be placed inside a described environment from a single text prompt. The generative architecture consists of a VAE which learns a compact latent representation, followed by a NLP-conditioned rectified flow-matching model that generates within that space guided by language.
SoundCLIP: Unified Audio-Visual Understanding Live Demo • GitHub • Dataset
Adapt LLaVA latent space to ingest audio alongside video. Interactive demonstration of unified multimodal token alignment for audio-visual understanding tasks.
AVVA: Multi-LLM orchestration for Audio-Video Vector Alignment Project Site • GitHub
Multi-LLM-gated curation system for large-scale audiovisual data. Quality-over-quantity approach for training data-efficient audio-video foundation models.
PW-VQA: Possible Worlds Visual Question Answering GitHub
Causal VQA benchmark for investigating cross-modal bias. Interactive evaluation framework for testing multimodal reasoning fidelity.
MMPerspective: Multimodal Perspective Understanding GitHub
Comprehensive benchmark for perspective perception, reasoning, and robustness in multimodal large language models.