Icon TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

1Dalian University of Technology, 2Hong Kong University of Science and Technology, 3Huawei Noah's Ark Lab 4The Chinese University of Hong Kong 5GAC Research and Development Center
( *Equal contribution. Corresponding authors. )
Method Overview

TrackDiffusion generates continuous video sequences from the tracklets.

The framework generates video frames based on the provided tracklets and employs the Instance Enhancer to reinforce the temporal consistency of foreground instance. A new gated cross-attention layer is inserted to take in the new instance information.

ModelScope AnimateDiff Ours
The motorcyclist crouched into a streamlined position, speeding left down the country road.
ModelScope AnimateDiff Ours
An SUV is driving to the left on a rugged mountain road, with the camera panning along with its movement.
ModelScope AnimateDiff Ours
A deer is swiftly running through the forest.
ModelScope AnimateDiff Ours
A rocket rapidly ascends, camera shakily tracking its path.
ModelScope AnimateDiff Ours
Two parrots playfully interact in a cage, touching each other.
ModelScope AnimateDiff Ours
A large truck is winding its way along a twisting mountain road.


Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems.

To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.

Examples on Driving Scenes

Downstream Support

Generation from TrackDiffusion can be used as data augmentation, supporting multi-object tracking tasks. Comparison of trainability on YTVIS dataset. The 480x320 TrackDiffusion variant is used for trainability evaluation.

Method TrackAP TrackAP50
Real only 45.4 64.1
Vanilla 45.2 (-0.2) 60.7 (-3.4)
TrackDiffusion 46.7 (+1.3) 65.6 (+1.5)