Ali Vosoughi
Ali Vosoughi
阿力
PhD Candidate advised by Prof Axel Wismueller and Prof Chenliang Xu
University of Rochester
🤖 Agentic AI Systems 🎵 Computer Audition 🧠 Multimodal Reasoning 🎬 Multimodal Generation 🥽 Immersive Computing 🔍 Reasoning Verification 🎯 Reinforcement Learning 🚀 Large Action Models 🔊 Audio Generation 📹 Video Generation
📧 ali.vosoughi@rochester.edu
📍 CS Department, Wegmans Hall 3211
🍎 Apple
Machine Learning Intern
Agentic Multimodal AI
present
🎵 Smule AI
Research Scientist Intern
Spatial Audio Generation
Jun–Sep 2025
🏢 Microsoft Research
Research Intern
Audiovisual LLM
May–Aug 2024
🚗 Bosch AI Research
Research Intern
Audio LLM
Apr–Jul 2023
🛡️ DARPA PTG
Graduate Researcher
Autonomous AR Copilot
2022–present
🏆
First counterfactual audio methods
ICASSP’24 + US Patent US20250124292A1 (published Jan 2025)
🤝
Autonomous multimodal copilot
Real-time AR demonstrations (DARPA)
📊
VERIFY benchmark
Reasoning verification framework

Recent News & Updates

10/2024
🎤 Presented at SANE 2024, DeepMind Boston
10/2024
📄 ACM Multimedia 2024 paper accepted
08/2024
💼 Research presentation at Microsoft, Seattle
03/2024
📄 NAACL 2024 paper accepted
02/2024
📄 IEEE Transactions on Multimedia paper
08/2023
🎯 Two ICCV 2023 papers accepted

Publications

PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
[Paper][Website]

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Under Review’26
[Paper][Website][🤗 Hugging Face]

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
European Signal Processing Conference (EUSIPCO) 2025
[Paper][Website]

EAGLE: Egocentric AGgregated Language-video Engine
ACM International Conference on Multimedia (ACM MM) 2024
[Paper]

PW-VQA: Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA
IEEE Transactions on Multimedia (TMM) 2024

[Paper][Code][Website]

OSCaR: Object State Captioning and State Change Representation
North American Chapter of the Association for Computational Linguistics (NAACL) 2024
[Paper][Code]

Video Understanding with Large Language Models: A Survey
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 2025
[Paper][Code]

Learning Audio Concepts from Counterfactual Natural Language
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
[Paper][Code][Patent]

AVSA-Sep: Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper]

MISAR: A Multimodal Instructional System with Augmented Reality
IEEE/CVF International Conference on Computer Vision (ICCV) 2023: ICCV AV4D Workshop
[Paper][Code][Video]

Relation Discovery in Nonlinearly Related Large-scale Settings
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
[Paper][Code]

Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate Environments
European Signal Processing Conference (EUSIPCO) 2021
[Paper]

Large-scale Nonlinear Granger Causality for Inferring Directed Dependence from Short Multivariate Time-series Data
Scientific Reports, Nature Publishing Group (Nature) 2021
[Paper][Code]


Personal Gallery

Ali Vosoughi
Ali Vosoughi