Introduction
Imagine asking an AI to generate an animation of a person “walking forward.” By today’s standards, this is a solved problem. Modern diffusion models can generate a realistic walk cycle in seconds. But what happens if you increase the complexity? What if you ask for a person “walking forward AND waving both hands, but NOT turning around”?
This is where standard generative models often stumble. Humans are masters of composition. We can effortlessly blend simple concepts—walking, waving, looking left—into a single, coherent behavior. We can also understand negative constraints (what not to do) just as easily as positive ones.
For Artificial Intelligence, specifically Latent Diffusion Models (LDMs), this type of composition is notoriously difficult. While LDMs are incredibly efficient at generating high-fidelity motion, they often struggle to combine multiple semantic concepts or adhere to strict negative constraints without distorting the motion or ignoring parts of the prompt.
In this post, we are diving deep into EnergyMoGen, a novel framework presented by researchers from the University of Technology Sydney and Zhejiang University. This paper proposes a fascinating shift in perspective: viewing diffusion models through the lens of Energy-Based Models (EBMs). By treating motion generation as an energy minimization problem, EnergyMoGen achieves state-of-the-art results in composing complex human motions, handling both “conjunction” (doing A and B) and “negation” (doing A but not B).

As illustrated above, the goal is to perform arithmetic on behavior—adding actions together or subtracting specific traits—to generate rich, complex animations.
The Core Problem: Why Composition is Hard
To understand why EnergyMoGen is necessary, we first need to understand the limitations of current motion generation tech.
Most state-of-the-art methods rely on Latent Diffusion Models (LDMs). Instead of generating the raw coordinates of a skeleton frame-by-frame (which is computationally expensive), LDMs compress motion into a lower-dimensional “latent space.” The diffusion process happens in this compressed space, making it fast and efficient.
However, this compression comes at a cost. In a raw skeleton model, “waving” corresponds to specific joints moving in 3D space. In latent space, “waving” is just a dense vector of numbers. There is no explicit spatial correspondence. This makes it incredibly difficult to “paste” a waving motion onto a walking motion using traditional blending techniques.
Furthermore, standard diffusion models typically use a single latent vector (or a fixed sequence) to represent a whole motion. When you feed a complex prompt like “walking and drinking,” the model tries to map that entire complex sentence to a single distribution. Often, one concept dominates, or the resulting motion is a weird, blurry hybrid that fails to capture the distinct details of both actions.
Background: Energy-Based Models (EBMs)
The researchers tackled this by revisiting a concept with a long history in physics and machine learning: Energy.
In the context of generative modeling, an “energy function” defines a potential field. Think of a landscape with hills and valleys.
- Low Energy (Valleys): Represents data states that are desirable, realistic, and match our text description.
- High Energy (Hills): Represents unrealistic, distorted, or mismatched data.
The goal of generation is to roll a ball down the hill—to iteratively update the data until it settles into a low-energy valley. Mathematically, the probability density of a data sample \(X\) is defined by the Boltzmann distribution:

Here, \(E_{\theta}(X)\) is the energy function. The lower the energy, the higher the probability \(p_{\theta}(X)\) that the sample is valid.
The key insight of this paper is that Diffusion Models can be interpreted as Energy-Based Models. When a diffusion model removes noise from an image or motion sequence, it is essentially calculating the gradient (the slope) of the data distribution to move towards higher probability (lower energy).
The EnergyMoGen Method
EnergyMoGen isn’t just a single model; it’s a framework that combines different ways of viewing “energy” to achieve the best possible motion. The authors break this down into two “spectrums” of Energy-Based Models and then fuse them together.

Let’s break down the architecture shown in Figure 2.
- Motion VAE (Part a): Compresses 3D motion into latent vectors \(z\).
- Latent Diffusion Model (Part b): A transformer-based network that learns to denoise these latent vectors, conditioned on text.
- Compositional Generation (Part c): This is where the magic happens. Instead of just running the diffusion model normally, the authors manipulate the process using two distinct energy approaches.
Spectrum 1: The Latent-Aware EBM
The first approach treats the denoising network itself as the energy function.
In a standard diffusion model, the network \(\epsilon_{\theta}\) predicts the noise to be removed. If we view the diffusion process as Langevin Dynamics (a physics-based sampling method), the “score” predicted by the network is proportional to the gradient of the energy function.

This implies we can compose motions by simply combining the predicted noise (energy gradients) from different concepts.
Conjunction (AND): If we want a motion that satisfies concept \(c_1\) (e.g., “walk”) AND concept \(c_2\) (e.g., “wave”), we can sum their energy gradients. The paper formalizes this using a modified Classifier-Free Guidance equation:

Here, the model predicts the noise for the combined concepts by taking the unconditional noise and adding the weighted guidance from each individual concept \(c_i\). This pushes the latent vector towards a region that satisfies all concepts simultaneously.
Negation (NOT): What if we want “jumping” but NOT “forward”? We can subtract the energy gradient of the unwanted concept.

By subtracting the gradient of concept \(c_j\) (the negative constraint) relative to concept \(c_i\) (the positive base), the model is steered away from the “forward” motion while keeping the “jump.”
Pros & Cons of Latent-Aware:
- Pros: It produces very smooth, coherent motions because it operates directly on the motion latents.
- Cons: It sometimes suffers from “text misalignment.” The motion might look good physically but might miss specific semantic details requested in the text.
Spectrum 2: The Semantic-Aware EBM
To fix the misalignment issues, the authors introduce a second spectrum: interpreting Cross-Attention as an energy operation.
In transformer models, cross-attention is the mechanism where the motion features “look at” the text features to decide what to generate. The authors argue that high attention scores indicate a high compatibility (low energy) between the motion and the text.
Energy-Based Cross-Attention: Instead of just using the text embeddings \(c\) as static inputs, EnergyMoGen updates them during the generation process. It treats the cross-attention map as an energy function defining the alignment between text and motion.
By calculating the gradient of this energy with respect to the text embeddings, the model can iteratively refine the text inputs using Adaptive Gradient Descent (AGD).

This effectively “nudges” the text embedding to focus on the most relevant semantic parts of the prompt that align with the current motion state. To make this stable, the authors use a specific gradient formulation that balances attention maximization with regularization:

Pros & Cons of Semantic-Aware:
- Pros: incredible text alignment. It captures fine-grained details in the prompt.
- Cons: It can cause “motion distortion.” Since it aggressively optimizes for semantic alignment, it might ignore physical constraints, leading to issues like foot sliding (where feet skate across the ground) or jittery movement.
The Solution: Synergistic Energy Fusion (SEF)
We have two methods:
- Latent-Aware: Good motion quality, weaker text adherence.
- Semantic-Aware: Great text adherence, weaker motion quality.
The brilliance of EnergyMoGen lies in Synergistic Energy Fusion (SEF). The authors propose combining these two energy terms, along with a standard multi-concept generation term, into a single unified update step.

In this equation:
- \(\epsilon_{\theta}^l\): The Latent-Aware term (smoothness).
- \(\epsilon_{\theta}^s\): The Semantic-Aware term (text details).
- \(\epsilon_{\theta}\): A standard term for the combined text prompt.
- \(\lambda\): Weighting hyperparameters that balance the three.
By tuning these weights, EnergyMoGen achieves the “best of both worlds”—physically plausible motion that faithfully adheres to complex, multi-part text prompts.
Experimental Results
The researchers tested EnergyMoGen on three major benchmarks: HumanML3D, KIT-ML, and MTT (Multi-Track Timeline). They evaluated the model on tasks ranging from standard text-to-motion generation to complex compositional tasks.
Quantitative Analysis
First, let’s look at the standard text-to-motion generation results on HumanML3D.

The table above shows that EnergyMoGen (specifically the version marked with *) outperforms competitors like MLD, MotionDiffuse, and ReMoDiffuse.
- R-Precision (Top-1, Top-2, Top-3): Measures how accurately the generated motion matches the text. EnergyMoGen scores highest here, indicating superior semantic understanding.
- FID: Measures the distance between the distribution of generated motions and real motions. Lower is better. EnergyMoGen achieves a remarkably low FID of 0.188, beating most skeleton-based and latent-based models.
Similar dominance is seen on the KIT-ML dataset:

Again, EnergyMoGen achieves the best R-Precision and Fidelity scores, proving its robustness across different datasets.
Evaluating Composition (The Real Test)
The true test of this paper is the MTT dataset, which is designed to test multi-concept generation (e.g., “A person walks and waves”).

Table 3 reveals the impact of the different components:
- Ours (Latent only): Good transition distance (smoothness) but lower R-Precision.
- Ours (Semantic only): High R-Precision but worse transition distance (jittery).
- Ours + SEF (Synergistic Energy Fusion): This combination achieves the highest scores across the board. It maintains the high accuracy of the semantic model while keeping the motion smooth like the latent model.
Qualitative Visualization
Numbers are great, but in computer animation, seeing is believing.

In Figure 3, the prompt is “sits down in a chair and then gets back up.”
- MLD & FineMoGen: Fail to generate the “sitting” action clearly.
- ReMoDiffuse: Fails to generate the “getting back up” part.
- EnergyMoGen: Successfully executes the full sequence—sit down, then stand up—matching the Ground Truth closely.
The model’s ability to handle logical operations is visually demonstrated in Figure 4 below.

- Panel (b) Negation: The prompt implies “Sitting down” minus “Sitting down” (Negation). The model effectively removes the sitting action while keeping the rest of the context.
- Panel (c) Mixed: “Squats AND turns right NOT stands back up.” The model generates a squat and turn, but the figure stays down, adhering to the negative constraint.
Understanding the Energy Landscape
To prove that their “Energy” theory isn’t just a metaphor, the authors visualized the latent distributions.

Figure 5 is fascinating. It compares the energy contour maps of motions generated via composition (adding/subtracting concepts) vs. motions generated from a single sentence containing all concepts.
- Row (a) Conjunction: The energy map of “Walk + Wave” (composed) looks almost identical to “Walk and Wave” (single text).
- Row (b) Negation: “Jump right” minus “Jump” results in an energy map that looks like “Walk right.”
The red highlighted regions show that the model is genuinely finding the same “low energy valleys” regardless of whether it reaches them through a single prompt or by mathematically combining different prompts.
Addressing Foot Sliding
One specific artifact mentioned earlier was “foot sliding”—a common issue where an AI character looks like they are moonwalking on ice.

The ablation study in Table 11 confirms the theory behind Synergistic Energy Fusion. The “Semantic-only” model has a high PFC score (1.05), indicating significant sliding. The SEF model drops this to 0.51, which is even lower than the latent-only model, showing that the fusion strategy effectively grounds the motion physics.
CompML: Extending the Data
Finally, the authors demonstrate that EnergyMoGen is so effective it can be used for Data Augmentation. They created a new dataset called CompML by using their model to synthesize 5,000 new complex motions from combined text prompts.
When they fine-tuned their model on this synthetic data, performance on the HumanML3D test set improved even further (refer back to Table 1, the bottom row “EnergyMoGen (CompML)”). This suggests a virtuous cycle: better compositional models can generate training data to build even better models.
Conclusion
EnergyMoGen represents a significant step forward in human motion generation. By formalizing the diffusion process as an energy minimization problem, the authors have unlocked a flexible way to perform “arithmetic” on human behavior.
The key takeaways are:
- Duality of Energy: Utilizing both Latent-Aware (smoothness) and Semantic-Aware (accuracy) energy terms is superior to using either alone.
- Synergistic Fusion: Blending these energies allows for precise control over complex composite prompts involving AND and NOT logic.
- Generality: This framework applies to latent diffusion models, making it efficient and scalable compared to skeleton-based approaches.
For students and researchers in computer animation and generative AI, EnergyMoGen highlights the power of revisiting fundamental concepts—like energy potentials—to solve modern deep learning challenges. As we move towards more interactive and controllable avatars in gaming and the metaverse, techniques like this that allow for precise, logical composition of movement will be essential.
](https://deep-paper.org/en/paper/2412.14706/images/cover.png)