Ali Vosoughi

阿力

PhD Candidate advised by Prof Axel Wismueller and Prof Chenliang Xu
University of Rochester

🤖 Agentic AI Systems 🎵 Computer Audition 🧠 Multimodal Reasoning 🎬 Multimodal Generation 🥽 Immersive Computing 🔍 Reasoning Verification 🎯 Reinforcement Learning 🚀 Large Action Models 🔊 Audio Generation 📹 Video Generation

📧 ali.vosoughi@rochester.edu

📍 CS Department, Wegmans Hall 3211

🍎 Apple

Machine Learning Intern
Agentic Multimodal AI

present

🎵 Smule AI

Research Scientist Intern
Spatial Audio Generation

Jun–Sep 2025

🏢 Microsoft Research

Research Intern
Audiovisual LLM

May–Aug 2024

🚗 Bosch AI Research

Research Intern
Audio LLM

Apr–Jul 2023

🛡️ DARPA PTG

Graduate Researcher
Autonomous AR Copilot

2022–present

🏆

First counterfactual audio methods
ICASSP’24 + US Patent US20250124292A1 (published Jan 2025)

🤝

Autonomous multimodal copilot
Real-time AR demonstrations (DARPA)

📊

VERIFY benchmark
Reasoning verification framework

Recent News & Updates

03/2025

🚀 Published VERIFY benchmark

10/2024
🎤 Presented at SANE 2024, DeepMind Boston

10/2024

📄 ACM Multimedia 2024 paper accepted

08/2024
💼 Research presentation at Microsoft, Seattle

03/2024

📄 NAACL 2024 paper accepted

02/2024

📄 IEEE Transactions on Multimedia paper

08/2023

🎯 Two ICCV 2023 papers accepted

04/2023

🏢 Started internship at Bosch Center for AI

04/2022

🏆 Nominated for Donald M. and Janet C. Barnard Fellowship

Publications

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
[Paper][Website]

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Under Review’26
[Paper][Website][🤗 Hugging Face]

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
European Signal Processing Conference (EUSIPCO) 2025
[Paper][Website]

EAGLE: Egocentric AGgregated Language-video Engine
ACM International Conference on Multimedia (ACM MM) 2024
[Paper]

PW-VQA: Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA
IEEE Transactions on Multimedia (TMM) 2024

[Paper][Code][Website]

OSCaR: Object State Captioning and State Change Representation
North American Chapter of the Association for Computational Linguistics (NAACL) 2024
[Paper][Code]

Video Understanding with Large Language Models: A Survey
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025
[Paper][Code]

Learning Audio Concepts from Counterfactual Natural Language
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
[Paper][Code][Patent]

AVSA-Sep: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper]

MISAR: A Multimodal Instructional System with Augmented Reality
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper][Code][Video]

Relation Discovery in Nonlinearly Related Large-scale Settings
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
[Paper][Code]

Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate Environments
European Signal Processing Conference (EUSIPCO) 2021
[Paper]

Large-scale Nonlinear Granger Causality for Inferring Directed Dependence from Short Multivariate Time-series Data
Scientific Reports, Nature Publishing Group (Nature) 2021
[Paper][Code]

Personal Gallery

Ali Vosoughi

Recent News & Updates

Publications

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow MatchingIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026[Paper][Website]

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning FidelityUnder Review’26[Paper][Website][🤗 Hugging Face]

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model European Signal Processing Conference (EUSIPCO) 2025[Paper][Website]

EAGLE: Egocentric AGgregated Language-video Engine ACM International Conference on Multimedia (ACM MM) 2024[Paper]

PW-VQA: Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQAIEEE Transactions on Multimedia (TMM) 2024

[Paper][Code][Website]

OSCaR: Object State Captioning and State Change RepresentationNorth American Chapter of the Association for Computational Linguistics (NAACL) 2024[Paper][Code]

Video Understanding with Large Language Models: A SurveyIEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025[Paper][Code]

Learning Audio Concepts from Counterfactual Natural LanguageIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024[Paper][Code][Patent]

AVSA-Sep: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound SeparationIEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop[Paper]

MISAR: A Multimodal Instructional System with Augmented RealityIEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop[Paper][Code][Video]

Relation Discovery in Nonlinearly Related Large-scale SettingsIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022[Paper][Code]

Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate EnvironmentsEuropean Signal Processing Conference (EUSIPCO) 2021[Paper]

Large-scale Nonlinear Granger Causality for Inferring Directed Dependence from Short Multivariate Time-series DataScientific Reports, Nature Publishing Group (Nature) 2021[Paper][Code]