TMPO
HUST Logo MAIR Logo
Kuaishou Logo

TMPO: Trajectory Matching
Policy Optimization for Diverse and
Efficient
Diffusion Alignment

Jiaming Li1,2* Chenyu Zhu1* Nanxi Yi1 Youjun Bao2 Li Sun2 Quanying Lv2 Xiang Fang3 Daizong Liu4 Jianjun Li1 Kun He1 Bowen Zhou5 Zhiyuan Ma1+
1Huazhong University of Science and Technology 2Kuaishou Technology 3Nanyang Technological University 4Wuhan University 5Tsinghua University
* Equal contribution + Corresponding author
Scroll
Qualitative Comparison

Prompt fidelity without collapsing the sample set

Before the abstract, this image-led stage surfaces direct visual comparisons from the paper assets. Each row pairs TMPO samples with the corresponding baseline outputs to make diversity, spatial layout, and text rendering differences immediately visible.

01 / Compositional Diversity

Multiple valid layouts stay alive.

TMPO qualitative comparison sample 1
TMPO qualitative comparison sample 2
TMPO qualitative comparison sample 3
Baseline qualitative comparison sample 4
Baseline qualitative comparison sample 5
Baseline qualitative comparison sample 6
02 / Text Rendering

Readable signs vary without drifting off prompt.

TMPO qualitative comparison sample 7
TMPO qualitative comparison sample 8
TMPO qualitative comparison sample 9
Baseline qualitative comparison sample 10
Baseline qualitative comparison sample 11
Baseline qualitative comparison sample 12
03 / Preference Alignment

Reward improves while image families remain broad.

TMPO qualitative comparison sample 13
TMPO qualitative comparison sample 14
TMPO qualitative comparison sample 15
Baseline qualitative comparison sample 16
Baseline qualitative comparison sample 17
Baseline qualitative comparison sample 18
Abstract

Reward Distribution Matching for Diffusion Alignment

TMPO replaces scalar reward maximization with trajectory-level reward distribution matching. Instead of concentrating probability on a few high-reward denoising paths, it matches policy probabilities over a group of trajectories to a reward-induced Boltzmann distribution.

The resulting Softmax Trajectory Balance objective inherits the mode-covering behavior of forward KL, preserving coverage over acceptable trajectories while still improving reward. Dynamic Stochastic Tree Sampling shares denoising prefixes and branches at scheduled steps, reducing redundant computation for large flow-matching models.

Qualitative diversity comparison between TMPO and Flow-GRPO
TMPO preserves compositional, spatial-layout, and semantic diversity while improving reward alignment.
Method

Softmax-TB + Dynamic Tree Rollouts

For each prompt, TMPO samples a shared-prefix trajectory tree, scores terminal images, and optimizes a partition-free distribution matching objective over the observed trajectory group.

01

Trajectory Groups

Generate K trajectories from the same prompt so reward and policy probabilities can be normalized within the group.

02

Boltzmann Target

Convert terminal rewards into a softmax target, sharpening preference while retaining multiple valid modes.

03

Forward-KL Advantage

Use a log-ratio advantage that penalizes under-covered positive-reward modes instead of chasing only the top sample.

04

Prefix Sharing

Branch dynamically across denoising steps so large-scale FLUX training avoids redundant full rollouts.

Overview of the TMPO framework
Framework overview: tree sampling produces 27 terminal trajectories, then Softmax-TB matches reward and policy distributions.
Results

Best Reward-Diversity-Efficiency Trade-Off

Across FLUX.1-dev alignment settings, TMPO obtains the strongest diversity metrics while staying competitive or best on downstream rewards and reducing per-iteration time.

0.949 GenEval accuracy

Best compositional generation score under GenEval-only training.

24.277 PickScore

Best human-preference reward in PickScore-only alignment.

0.204 LGMD diversity

Positive latent-space diversity where reward-maximizing baselines collapse.

68.3s iteration time

Faster than Flow-GRPO, MixGRPO, TreeGRPO, and GARDO in preference alignment.

Reward diversity and efficiency analysis
TMPO lies on the favorable Pareto frontier for reward, diversity, and iteration time.
Pareto analysis plot
Reward-efficiency comparison against GRPO-style alignment methods.
GenEval training curve
Compositional generation.
OCR training curve
Visual text rendering.
PickScore training curve
Human preference alignment.
Citation

TMPO: Trajectory Matching Policy Optimization

Please cite the paper if you build on the trajectory-level reward distribution matching objective, Dynamic Stochastic Tree Sampling, or the diffusion alignment experiments.

@article{li2026tmpo,
  title={TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment},
  author={Li, Jiaming and Zhu, Chenyu and Yi, Nanxi and Bao, Youjun and Sun, Li and Lv, Quanying and Fang, Xiang and Liu, Daizong and Li, Jianjun and He, Kun and Zhou, Bowen and Ma, Zhiyuan},
  journal={Preprint},
  year={2026}
}