DOPO: Dense Online Preference Optimization for Cross-Dataset Motion Diffusion Adaptation

Media Integration and Communication Center (MICC), University of Florence, Florence, Italy.
"A person in a fighting stance does two kicks with their right leg."
"A person walks forward then turns around and walks back."

Figure: Qualitative comparison of generated animations before and after fine-tuning. Each row shows the same text prompt with results from the pretrained MDM on BABEL (left) and the model finetune with our approach on HumanML3D (right). Our fine-tuning improves motion quality and text-motion alignment.

Abstract

Adapting text-conditioned motion diffusion models to new domains typically requires collecting additional motion capture data and retraining from scratch, a process that is expensive, time-consuming, and often impractical. In this paper we propose DOPO (Dense Online Preference Optimization), a post-training framework that fine-tunes pretrained motion diffusion models using only textual prompts from the target domain, without directly requiring ground-truth motion data or human preference annotations. DOPO builds on two components: (1) TMR-Dense, a timestep-conditioned text-motion retrieval network that evaluates motion-text alignment at arbitrary points along the denoising trajectory, enabling dense preference signals throughout generation; and (2) an online adaptation of Step-by-step Preference Optimization that uses TMR-Dense to automatically construct preference pairs from target domain prompts, eliminating the need for pre-collected preference datasets. Unlike prior reinforcement learning approaches that rely on sparse terminal rewards, DOPO provides step-aware supervision while avoiding the instabilities associated with policy gradient optimization. We evaluate our approach on cross-dataset adaptation scenarios spanning BABEL, HumanML3D, and MotionX, testing both latent-space and joint-space diffusion architectures. Experimental results indicate consistent improvements over zero-shot baselines and prior RL-based method across metrics, with training times reduced by approximately an order of magnitude. DOPO preserves performance on source distributions after adaptation, suggesting that online preference-based fine-tuning offers a practical path toward more adaptable motion generation systems.

Approach Overview

Comparison

Pretrained MDM

Prompt: "a person jogs in place"

Pretrained StableMoFusion

Prompt: "person is doing some 90's dances"

Ours MDM finetuned

Prompt: "a person jogs in place"

Ours StableMoFusion finetuned

Prompt: "person is doing some 90's dances"


Pretrained MDM

Prompt: "a person holds their arms up then lowers them, then bends their arms at the elbow and moves back and forth"

Pretrained StableMoFusion

Prompt: "a person is holding something in his or her hand, turns it upward, he looks down at it and puts his hand down"

Ours MDM finetuned

Prompt: "a person holds their arms up then lowers them, then bends their arms at the elbow and moves back and forth"

Ours StableMoFusion finetuned

Prompt: "a person is holding something in his or her hand, turns it upward, he looks down at it and puts his hand down"