Interactive Demos & Tools

PromptReverb: Text-to-Spatial-Audio Generation
Project Site
A multimodal generative model that conditions on natural language to synthesize immersive responses, enabling multimedia experiences to be placed inside a described environment from a single text prompt. The generative architecture consists of a VAE which learns a compact latent representation, followed by a NLP-conditioned rectified flow-matching model that generates within that space guided by language.

SoundCLIP: Unified Audio-Visual Understanding
Live Demo • GitHub • Dataset
Adapt LLaVA latent space to ingest audio alongside video. Interactive demonstration of unified multimodal token alignment for audio-visual understanding tasks.

AVVA: Multi-LLM orchestration for Audio-Video Vector Alignment
Project Site • GitHub
Multi-LLM-gated curation system for large-scale audiovisual data. Quality-over-quantity approach for training data-efficient audio-video foundation models.

PW-VQA: Possible Worlds Visual Question Answering
GitHub
Causal VQA benchmark for investigating cross-modal bias. Interactive evaluation framework for testing multimodal reasoning fidelity.

MMPerspective: Multimodal Perspective Understanding
GitHub
Comprehensive benchmark for perspective perception, reasoning, and robustness in multimodal large language models.