TMPO: Trajectory Matching
Policy Optimization for Diverse and
Efficient Diffusion Alignment

Jiaming Li^1,2* Chenyu Zhu^1* Nanxi Yi¹ Youjun Bao² Li Sun² Quanying Lv² Xiang Fang³ Daizong Liu⁴ Jianjun Li¹ Kun He¹ Bowen Zhou⁵ Zhiyuan Ma¹⁺

¹Huazhong University of Science and Technology ²Kuaishou Technology ³Nanyang Technological University ⁴Wuhan University ⁵Tsinghua University

* Equal contribution + Corresponding author

Paper Code

Scroll

Qualitative Comparison

Prompt fidelity without collapsing the sample set

Before the abstract, this image-led stage surfaces direct visual comparisons from the paper assets. Each row pairs TMPO samples with the corresponding baseline outputs to make diversity, spatial layout, and text rendering differences immediately visible.

01 / Compositional Diversity

Multiple valid layouts stay alive.

Baseline qualitative comparison sample 4

Baseline qualitative comparison sample 5

Baseline qualitative comparison sample 6

02 / Text Rendering

Readable signs vary without drifting off prompt.

Baseline qualitative comparison sample 10

Baseline qualitative comparison sample 11

Baseline qualitative comparison sample 12

03 / Preference Alignment

Reward improves while image families remain broad.

Baseline qualitative comparison sample 16

Baseline qualitative comparison sample 17

Baseline qualitative comparison sample 18

Abstract

Reward Distribution Matching for Diffusion Alignment

TMPO replaces scalar reward maximization with trajectory-level reward distribution matching. Instead of concentrating probability on a few high-reward denoising paths, it matches policy probabilities over a group of trajectories to a reward-induced Boltzmann distribution.

The resulting Softmax Trajectory Balance objective inherits the mode-covering behavior of forward KL, preserving coverage over acceptable trajectories while still improving reward. Dynamic Stochastic Tree Sampling shares denoising prefixes and branches at scheduled steps, reducing redundant computation for large flow-matching models.

Qualitative diversity comparison between TMPO and Flow-GRPO — TMPO preserves compositional, spatial-layout, and semantic diversity while improving reward alignment.

Method

Softmax-TB + Dynamic Tree Rollouts

For each prompt, TMPO samples a shared-prefix trajectory tree, scores terminal images, and optimizes a partition-free distribution matching objective over the observed trajectory group.

Trajectory Groups

Generate K trajectories from the same prompt so reward and policy probabilities can be normalized within the group.

Boltzmann Target

Convert terminal rewards into a softmax target, sharpening preference while retaining multiple valid modes.

Forward-KL Advantage

Use a log-ratio advantage that penalizes under-covered positive-reward modes instead of chasing only the top sample.

Prefix Sharing

Branch dynamically across denoising steps so large-scale FLUX training avoids redundant full rollouts.

Overview of the TMPO framework — Framework overview: tree sampling produces 27 terminal trajectories, then Softmax-TB matches reward and policy distributions.

Results

Best Reward-Diversity-Efficiency Trade-Off

Across FLUX.1-dev alignment settings, TMPO obtains the strongest diversity metrics while staying competitive or best on downstream rewards and reducing per-iteration time.

0.949 GenEval accuracy

Best compositional generation score under GenEval-only training.

24.277 PickScore

Best human-preference reward in PickScore-only alignment.

0.204 LGMD diversity

Positive latent-space diversity where reward-maximizing baselines collapse.

68.3s iteration time

Faster than Flow-GRPO, MixGRPO, TreeGRPO, and GARDO in preference alignment.

Reward diversity and efficiency analysis — TMPO lies on the favorable Pareto frontier for reward, diversity, and iteration time.

Pareto analysis plot — Reward-efficiency comparison against GRPO-style alignment methods.

GenEval training curve — Compositional generation.

OCR training curve — Visual text rendering.

PickScore training curve — Human preference alignment.

Qualitative Comparison

Faithful Images Without Collapsing the Sample Set

Qualitative examples show TMPO maintaining prompt fidelity and visibly richer variations in composition, background, viewpoint, and text layout.

Qualitative comparison grid for TMPO and baseline methods — TMPO produces diverse samples across GenEval, OCR, and PickScore protocols.

Citation