](https://deep-paper.org/en/paper/file-2984/images/cover.png)
Unraveling Intent: How Causal Inference and Disentanglement Improve Multimodal AI
In human communication, what we say is often less important than how we say it. A phrase like “Great job” can be a genuine compliment or a sarcastic critique depending on the speaker’s tone of voice and facial expression. For Artificial Intelligence, distinguishing between these nuances is the holy grail of Multimodal Intent Detection. To build systems that truly understand us—whether it’s a customer service bot or a smart home assistant—we need models that can process text, audio, and video simultaneously. While recent advances have improved how these modalities are fused, two significant problems remain: ...
](https://deep-paper.org/en/paper/2406.17328/images/cover.png)
](https://deep-paper.org/en/paper/file-2982/images/cover.png)
](https://deep-paper.org/en/paper/file-2981/images/cover.png)
](https://deep-paper.org/en/paper/2412.17537/images/cover.png)
](https://deep-paper.org/en/paper/2407.01470/images/cover.png)
](https://deep-paper.org/en/paper/2406.14492/images/cover.png)
](https://deep-paper.org/en/paper/file-2977/images/cover.png)
](https://deep-paper.org/en/paper/2410.03061/images/cover.png)
](https://deep-paper.org/en/paper/file-2974/images/cover.png)
](https://deep-paper.org/en/paper/2410.16472/images/cover.png)
](https://deep-paper.org/en/paper/2406.11925/images/cover.png)
](https://deep-paper.org/en/paper/2410.06524/images/cover.png)
](https://deep-paper.org/en/paper/2410.08320/images/cover.png)
](https://deep-paper.org/en/paper/2401.15498/images/cover.png)
](https://deep-paper.org/en/paper/2407.19726/images/cover.png)
](https://deep-paper.org/en/paper/2502.19573/images/cover.png)
](https://deep-paper.org/en/paper/2409.18602/images/cover.png)
](https://deep-paper.org/en/paper/file-2965/images/cover.png)
](https://deep-paper.org/en/paper/file-2964/images/cover.png)