[Making Large Language Models Better Reasoners with Orchestrated Streaming Experiences 🔗](https://arxiv.org/abs/2504.00473)

RoSE: How LLMs Can Self-Improve Through Orchestrated Streaming Experiences

Introduction Imagine a student preparing for a difficult mathematics exam. They don’t just memorize formulas; they work through practice problems. When they solve a problem correctly, they remember the logic they used. Later, when they face a similar but new problem, they recall that successful logic to guide them. This process—accumulating experiences, filtering out the mistakes, and recalling the most relevant and complex solutions—is fundamental to human learning. However, standard Large Language Models (LLMs) often lack this dynamic “experiential” capability in their standard deployment. They are typically static. You prompt them, they answer, and the interaction ends. If they solve a problem brilliantly, that “thought process” usually evaporates once the session closes. ...

2025-04 · 10 min · 1976 words
[Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training 🔗](https://arxiv.org/abs/2406.17404)

Speeding Up LLMs for Free: The "Make Some Noise" Training Framework

The capabilities of Large Language Models (LLMs) like GPT-4 and LLaMA have revolutionized artificial intelligence. However, if you have ever watched an LLM generate a response, you have likely noticed a fundamental bottleneck: the text appears one word at a time, like a slow typist. This sluggishness is due to the Auto-Regressive (AR) decoding paradigm. To generate the 100th token, the model strictly needs the previous 99 tokens. This sequential dependency prevents parallel processing during generation, leaving powerful GPUs idling while they wait for the next token to be decided. ...

2024-06 · 10 min · 2027 words
[Major Entity Identification: A Generalizable Alternative to Coreference Resolution 🔗](https://aclanthology.org/2024.emnlp-main.652.pdf)

Why We Should Stop Clustering and Start Identifying: A New Approach to Coreference

Imagine you are analyzing the novel Aladdin. You want to track every time the text refers to the protagonist, whether by his name (“Aladdin”), a nominal phrase (“the boy”), or a pronoun (“he”). In Natural Language Processing (NLP), this is classically handled by Coreference Resolution (CR). The goal of CR is to find every mention in a text and cluster them together based on which entity they refer to. It sounds straightforward, but in practice, CR is notoriously difficult. Models trained on news articles often fail spectacularly when applied to literature or medical texts. Why? because they get bogged down trying to resolve everything, including insignificant background characters or abstract concepts, often disagreeing on what even counts as a “mention.” ...

7 min · 1439 words
[MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension 🔗](https://arxiv.org/abs/2409.13609)

How MaPPER Makes Visual Grounding Efficient: A Deep Dive into Prior-Guided Tuning

Introduction Imagine you are looking at a crowded photograph of a street scene. A friend stands beside you and says, “Look at the guy in the yellow shirt standing near the bike.” Instantly, your brain processes the language, scans the image, filters out the “guys in blue shirts” and “guys near cars,” and locks onto the specific target. In computer vision, this task is known as Referring Expression Comprehension (REC). The goal is to ground a specific region in an image based on a natural language description. While this sounds intuitive to humans, it is a complex challenge for AI. It requires a model to possess strong visual perception, deep linguistic understanding, and—most importantly—the ability to align these two modalities perfectly. ...

2024-09 · 9 min · 1771 words
[MTLS: Making Texts into Linguistic Symbols 🔗](https://aclanthology.org/2024.emnlp-main.206.pdf)

Beyond Vocabulary: How Teaching AI to "See" Words Unlocks Multilingual Powers

Language is a strange thing. If you speak English, the word “Love” is a familiar sequence of four letters. If you speak Greek, “αγάπη” carries the same emotional weight but looks completely different. If you speak Chinese, “爱” is a distinct logogram. ...

4 min · 1957 words
[MTA4DPR: Multi-Teaching-Assistants Based Iterative Knowledge Distillation for Dense Passage Retrieval 🔗](https://aclanthology.org/2024.emnlp-main.336.pdf)

Why One Teacher Isn't Enough: Boosting Dense Retrieval with Multi-Assistant Distillation

Introduction In the world of Information Retrieval (IR), we are constantly balancing a difficult trade-off: accuracy versus speed. We want our search engines to understand the nuance of human language like a massive Large Language Model (LLM), but we need them to return results in milliseconds like a simple keyword search. Dense Passage Retrieval (DPR) has emerged as a powerful solution, using deep learning to represent queries and documents as vectors. However, the most accurate DPR models are often massive, computationally expensive, and slow to deploy. To solve this, researchers turn to Knowledge Distillation (KD)—a technique where a small, fast “student” model learns to mimic a large, slow “teacher” model. ...

8 min · 1657 words
[MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models 🔗](https://arxiv.org/abs/2401.16745)

Beyond the First Prompt: Evaluating How LLMs Handle Long Conversations with MT-Eval

Introduction We are currently living in the “golden age” of Large Language Models (LLMs). From drafting emails to generating code snippets, models like GPT-4 and Llama-2 have integrated themselves into our daily workflows. When we benchmark these models, however, we often treat them like search engines: we ask a single question, get a single answer, and grade the result. But is this how we actually use AI? In the real world, interaction is rarely a one-shot event. We chat. We ask for revisions. We change the topic slightly, then circle back to an earlier point. We ask the model to “remember what I said three messages ago.” This is the domain of multi-turn interaction, and it is a significantly harder challenge for an AI than answering a standalone query. ...

2024-01 · 9 min · 1752 words
[MSI-Agent: Incorporating Multi-Scale Insight into Embodied Agents for Superior Planning and Decision-Making 🔗](https://arxiv.org/abs/2409.16686)

Building Smarter Robots: How Multi-Scale Insights Solve the Memory Problem in Embodied AI

Introduction Imagine you are teaching a robot how to navigate a kitchen. On day one, you teach it how to make a salad. It learns a valuable lesson: “Use a bowl to hold the ingredients.” On day two, you ask the robot to water a plant. The robot, eager to apply its past knowledge, remembers the concept of a “bowl” and the action “fill with water.” However, because its memory is cluttered, it might erroneously try to “slice” the water or mix the plant with dressing because it associates bowls with cooking. ...

2024-09 · 9 min · 1759 words
[MQuinE: a cure for Z-paradox in knowledge graph embedding models 🔗](https://arxiv.org/abs/2402.03583)

The Z-Paradox: Why Your Knowledge Graph Model Might Be Hallucinating Connections (and How MQuinE Fixes It)

Introduction In the world of Artificial Intelligence, Knowledge Graphs (KGs) are the unsung heroes. They power the sidebar on your Google Search, drive product recommendations on Amazon, and help complex systems understand that “Paris” is the capital of “France.” To make these graphs useful for machine learning, we use Knowledge Graph Embedding (KGE) models. These models translate entities (like “Paris”) and relations (like “is capital of”) into mathematical vectors and matrices. ...

2024-02 · 10 min · 1948 words
[MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs 🔗](https://arxiv.org/abs/2403.05814)

Mastering the Art of Conversation: How MP2D Uses Knowledge Graphs to Teach AI Topic Shifts

Have you ever noticed how rigid most chatbots feel? You ask about the weather, and they tell you the forecast. You ask about a restaurant, and they give you a menu. But if you try to naturally segue from that restaurant to the history of the cuisine, and then perhaps to a famous chef who cooks that cuisine, the bot often stumbles. It loses context or treats the new topic as a completely isolated query. ...

2024-03 · 8 min · 1552 words
[MOSEL: Inference Serving Using Dynamic Modality Selection 🔗](https://arxiv.org/abs/2310.18481)

Less is More: Accelerating Multimodal AI with Dynamic Modality Selection (MOSEL)

Introduction We are living in the era of massive artificial intelligence. In recent years, deep learning models—particularly Transformers—have shattered records in computer vision and natural language processing. We have moved beyond simple image classifiers to complex multimodal systems capable of understanding video, audio, and text simultaneously. However, this capability comes at a steep price. Between 2012 and 2020, the computational requirements for state-of-the-art machine learning applications increased by a staggering factor of 1,000,000. While hardware has improved, it hasn’t kept pace with this exponential growth. This creates a massive bottleneck for inference serving—the process of actually running these models in production to generate predictions for users. ...

2023-10 · 11 min · 2195 words
[MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages 🔗](https://arxiv.org/abs/2410.01036)

Bridging the Gap: How MOSEL Brings True Open Source to Speech AI

Introduction In the rapidly evolving world of Artificial Intelligence, “Open Source” has become a buzzword. From Large Language Models (LLMs) to Speech Foundation Models (SFMs), developers and researchers are flooded with new releases claiming to be open. But if you scratch beneath the surface, a complex problem emerges: Open Washing. Many models release their weights (the trained parameters) but keep their training data and code proprietary. Even when data is released, it often comes with restrictive licenses—such as “Non-Commercial” or “No-Derivatives”—which strictly violates the definition of Open Source AI. For undergraduate and master’s students entering the field, this distinction is critical. You cannot build a truly open, community-driven, or commercial application if the foundation you are building on is legally shaky. ...

2024-10 · 8 min · 1629 words
[MORPHEUS: Modeling Role from Personalized Dialogue History by Exploring and Utilizing Latent Space 🔗](https://arxiv.org/abs/2407.02345)

Beyond Hardcoded Personas: How MORPHEUS Generates Personalized Dialogue Using Latent Space

Imagine chatting with a sophisticated AI. You ask, “What did you do this weekend?” and it replies, “I went hiking with my dog.” Ten minutes later, you mention you love nature, and it replies, “I hate the outdoors, I prefer video games.” The illusion breaks. The personality—or persona—is inconsistent. Building chatbots that maintain a consistent, engaging personality is one of the “holy grails” of Natural Language Processing (NLP). traditionally, this is done by explicitly feeding the model a “role profile” (e.g., Name: Alex, Hobby: Hiking, Pet: Dog) every time it generates a response. But in the real world, we often don’t have access to these explicit profiles due to data scarcity or privacy concerns. We only have the dialogue history—the breadcrumbs of personality left behind in previous conversations. ...

2024-07 · 10 min · 2056 words
[MMOE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts 🔗](https://arxiv.org/abs/2311.09580)

Why Your Multimodal AI Misses the Joke: Introducing Mixtures of Multimodal Interaction Experts (MMOE)

Introduction Imagine scrolling through social media and seeing a photo of a chaotic, messy room with the caption, “Living my best life.” As a human, you immediately recognize the sarcasm. The image (mess) and the text (“best life”) contradict each other, and that contradiction creates the meaning. Now, consider how a standard Multimodal AI sees this. It processes the image, processes the text, and tries to align them. It might see “mess” and “best life” and simply get confused because the semantic content doesn’t overlap. This highlights a critical limitation in current Vision-Language Models (VLMs): they are excellent at identifying when images and text say the same thing (redundancy), but they struggle when the meaning emerges from the interaction between the two (synergy), such as in sarcasm or humor. ...

2023-11 · 8 min · 1593 words
[MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language 🔗](https://arxiv.org/abs/2406.13698)

Lost in Translation: Why AI Struggles with Metaphors and How We Fix It

Introduction: The “Pink Elephant” in the Room Imagine you are trying to tell a friend that you were incredibly drunk last night. If you are speaking English, you might say you were “seeing pink elephants.” Now, imagine feeding that sentence into a translation engine to communicate with a friend in China. If the AI translates it literally, your Chinese friend might be confused about why you were at a zoo. In Chinese culture, a common metaphorical equivalent for being heavily drunk is “collapsed like quagmire” (烂醉如泥). ...

2024-06 · 4 min · 1636 words
[MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model 🔗](https://arxiv.org/abs/2406.11193)

Inside the Mind of Multimodal Models: Tracking Domain-Specific Neurons with MMNeuron

Introduction How does a large language model (LLM) “see” an image? When we feed a photograph of a chest X-ray or a satellite view of a city into a Multimodal Large Language Model (MLLM) like LLaVA or InstructBLIP, we know the architecture: an image encoder breaks the visual into features, a projector maps them to the language space, and the LLM generates a response. But what happens in the hidden layers between that initial projection and the final answer? ...

2024-06 · 9 min · 1896 words
[MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance 🔗](https://arxiv.org/abs/2401.02906)

Blinded by the Light: Securing Multimodal AI Against Visual Jailbreaks with MLLM-Protector

Introduction: The New Vulnerability in Multimodal AI The rapid evolution of Artificial Intelligence has taken us from text-based Large Language Models (LLMs) like GPT-3 to Multimodal Large Language Models (MLLMs) like LLaVA and GPT-4V. These newer models possess the remarkable ability to “see”—they can process images alongside text to answer complex queries. This leap forward opens up endless applications, from medical imaging analysis to assisting the visually impaired. However, this added modality introduces a significant, often overlooked security flaw. While the AI community has spent years refining safety alignment for text—ensuring models refuse to generate hate speech or bomb-making instructions—the visual component acts as a backdoor. ...

2024-01 · 8 min · 1561 words
[MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation 🔗](https://aclanthology.org/2024.emnlp-main.1103.pdf)

Beyond Fake News: Decoding the Intent and Manipulation Behind Disinformation

The term “fake news” has become a staple of modern vocabulary, but it is a clumsy instrument for a surgical problem. Disinformation isn’t just about truth versus fiction; it is about the intent to harm and the methods used to deceive. Whether it involves denying climate change or undermining public health during a pandemic, disinformation is a calculated effort to shift public perception. To combat this effectively, we need to understand not just what is being said, but why and how. This is the core problem addressed in a recent paper titled “MIPD: Exploring Manipulation and Intention In a Novel Corpus of Polish Disinformation.” ...

7 min · 1390 words
[MIND: Multimodal Shopping Intention Distillation from Large Vision-Language Models for E-commerce Purchase Understanding 🔗](https://arxiv.org/abs/2406.10701)

Why Did You Buy That? Decoding Shopping Intentions with Multimodal AI

Introduction Imagine walking into a store and buying a wireless mouse. A few minutes later, you pick up a solar-powered keyboard. To a human observer, the connection is obvious: you are likely setting up an eco-friendly, clutter-free home office. However, for traditional Artificial Intelligence systems in e-commerce, this connection is surprisingly difficult to make. Most existing systems rely solely on text—product titles and descriptions. When a text-based model sees “Orbit Trackball Mouse” and “Wireless Solar Keyboard,” it might correctly categorize them as “electronics,” but it often misses the nuanced intention behind the purchase. It fails to “see” that both items are white, ergonomic, and designed for a specific type of user. ...

2024-06 · 8 min · 1643 words
[MIBench: Evaluating Multimodal Large Language Models over Multiple Images 🔗](https://arxiv.org/abs/2407.15272)

Beyond the Single Frame: Why Multimodal LLMs Struggle with Multi-Image Scenarios

Introduction The rise of Multimodal Large Language Models (MLLMs) like GPT-4V, LLaVA, and mPLUG-Owl has revolutionized how Artificial Intelligence perceives the world. These models can describe photos, answer questions about diagrams, and even write code based on whiteboard sketches. However, there is a significant gap between these benchmark achievements and real-world utility. Most current benchmarks focus on single-image scenarios. The model is given one picture and asked a question. Yet, human visual consumption is rarely isolated to a single frame. When we browse a website, we integrate information from multiple product photos and textual descriptions. When we watch a tutorial, we follow a temporal sequence of steps. When we scroll through social media, we process interleaved text and images simultaneously. ...

2024-07 · 8 min · 1580 words