](https://deep-paper.org/en/paper/2506.01319/images/cover.png)
Less is More: How Sparsity Solves the Complexity of Music Audio-Visual QA
Introduction Imagine standing in the middle of a crowded jazz club. The drummer is keeping a complex beat, the bassist is walking through a progression, the pianist is improvising, and the crowd is murmuring. If someone asked you, “How many instruments are playing?” or “Is the saxophone playing right now?”, your brain wouldn’t process every single photon of light or every microsecond of sound pressure. Instead, you would filter out the noise. You would focus on key visual cues—the glint of the saxophone, the movement of the drummer’s sticks—and isolate specific audio frequencies. You intuitively discard the redundancy to answer the question. ...
](https://deep-paper.org/en/paper/2505.12265/images/cover.png)
](https://deep-paper.org/en/paper/2503.07457/images/cover.png)
](https://deep-paper.org/en/paper/2406.18403/images/cover.png)
](https://deep-paper.org/en/paper/file-2355/images/cover.png)
](https://deep-paper.org/en/paper/file-2354/images/cover.png)
](https://deep-paper.org/en/paper/2412.08985/images/cover.png)
](https://deep-paper.org/en/paper/2505.16061/images/cover.png)
](https://deep-paper.org/en/paper/2505.19599/images/cover.png)
](https://deep-paper.org/en/paper/2506.00637/images/cover.png)
](https://deep-paper.org/en/paper/file-2348/images/cover.png)
](https://deep-paper.org/en/paper/2506.07479/images/cover.png)
](https://deep-paper.org/en/paper/file-2346/images/cover.png)
](https://deep-paper.org/en/paper/2506.19571/images/cover.png)
](https://deep-paper.org/en/paper/file-2344/images/cover.png)
](https://deep-paper.org/en/paper/2505.10939/images/cover.png)
](https://deep-paper.org/en/paper/2501.06645/images/cover.png)
](https://deep-paper.org/en/paper/2506.00806/images/cover.png)
](https://deep-paper.org/en/paper/2506.19325/images/cover.png)
](https://deep-paper.org/en/paper/2503.17860/images/cover.png)