Interactive Demos & Tools

  • PromptReverb: Text-to-Spatial-Audio Generation
    Project Site
    A multimodal generative model that conditions on natural language to synthesize immersive responses, enabling multimedia experiences to be placed inside a described environment from a single text prompt. The generative architecture consists of a VAE which learns a compact latent representation, followed by a NLP-conditioned rectified flow-matching model that generates within that space guided by language.
  • SoundCLIP: Unified Audio-Visual Understanding
    Live DemoGitHubDataset
    Adapt LLaVA latent space to ingest audio alongside video. Interactive demonstration of unified multimodal token alignment for audio-visual understanding tasks.
  • AVVA: Multi-LLM orchestration for Audio-Video Vector Alignment
    Project SiteGitHub
    Multi-LLM-gated curation system for large-scale audiovisual data. Quality-over-quantity approach for training data-efficient audio-video foundation models.
  • PW-VQA: Possible Worlds Visual Question Answering
    GitHub
    Causal VQA benchmark for investigating cross-modal bias. Interactive evaluation framework for testing multimodal reasoning fidelity.
  • MMPerspective: Multimodal Perspective Understanding
    GitHub
    Comprehensive benchmark for perspective perception, reasoning, and robustness in multimodal large language models.