C
ChaoBro

NVIDIA AnyFlow: A "Step-Agnostic" Experiment in Video Diffusion Models—Can On-Policy Distillation End Inference Step Anxiety?

Users of video generation models have likely experienced this: when you just want a quick preview, the model requires 50 inference steps, which is agonizingly slow. When you want high-quality output, running it for 50 steps doesn't necessarily yield significantly better results than 25 steps.

NVIDIA's AnyFlow paper, released on May 13 (garnering 81 upvotes on Hugging Face Papers Trending), attempts to solve this problem. Its core idea is straightforward: let the model learn to work effectively across different step counts on its own—rather than requiring a separately trained distilled model for each step count, as is the case today.

The Inference Step Dilemma

Current video diffusion models face a structural issue:

The number of inference steps is fixed during training. If you train a model for 50 steps, 50 steps is also the optimal choice for inference. Want to speed it up? You can use Consistency Distillation or LCM techniques to compress it down to 4-8 steps. However, the trade-off is a drop in quality, and you need to train a separate distilled version for each.

This means that deploying a video generation service might require maintaining multiple models—a high-precision version (50 steps), a fast version (4 steps), and a medium-speed version (10 steps). Each version consumes VRAM and requires separate maintenance.

AnyFlow's ambition is simple: one model to cover the entire range of step counts.

On-Policy Flow Map Distillation

The paper's core method is called On-Policy Flow Map Distillation. Understanding it requires breaking down three concepts:

Flow Map: In flow-based diffusion models, the generation process from noise to data is modeled as a continuous flow. The Flow Map describes the transformation function of this flow.

On-Policy: During distillation, the model uses its own outputs as training signals instead of relying on a fixed teacher model. This means the model continuously calibrates itself using the outputs from its own current version during training.

Arbitrary-Step Training: The key trick—randomly sampling an inference step count k during training, and teaching the model to complete generation in exactly k steps, regardless of what k is. This allows the model to accept any step count as an input parameter during inference.

In practice, the model learns a continuous spectrum of step-count-to-quality trade-offs, rather than optimizing for a single fixed step count. One step yields a rough draft, 10 steps show significant improvement, and 25 steps approach optimal quality—all achieved with the exact same model.

Reception on HF Papers

AnyFlow has garnered 81 upvotes on Hugging Face Papers Trending, and its accompanying GitHub repository has reached 202 stars. Given that the paper was released less than two days ago, this level of attention indicates the community's strong interest in the "arbitrary step count" direction.

The paper's authors are from the NVIDIA research team. Given NVIDIA's investments in the video generation space (projects like Cosmos and Video LDM), AnyFlow is likely more than just an academic exploration; it's likely a technological reserve for its product lineup.

Comparison with Existing Solutions

In the "reduce inference steps" race, several major directions already exist:

Method Core Idea Limitations
Consistency Models Train the model to maintain consistent outputs across different timesteps Unstable training, noticeable quality degradation
LCM Latent Consistency Models, distilled to reduce steps Requires separate training for different step counts
Progressive Distillation Iterative distillation, halving the step count each round Still limited to a few discrete step counts
AnyFlow Random step sampling during training, on-policy distillation New method, requires further validation

What makes AnyFlow unique is that it doesn't chase the "fewest steps possible," but rather "arbitrary steps." This represents a different design philosophy—it doesn't assume users only need one fixed step count. Instead, it acknowledges that different scenarios require different step counts and enables a single model to adapt to all of them.

My Thoughts

This direction holds practical value, but it also warrants a measured perspective.

Points Worth Noting:

  • The simplification at the deployment level is tangible—one model replacing multiple versions
  • The on-policy distillation approach is more flexible than fixed teacher distillation, as the "teacher" itself continues to evolve
  • If NVIDIA integrates this into its video generation product line (e.g., Cosmos), its impact will be even greater

Points Requiring Validation:

  • How does the claimed "arbitrary step" quality curve perform on actual videos? Image and video generation are fundamentally different—videos require temporal consistency
  • The GitHub repository, with 202 stars, is still in a very early stage; reproducibility remains to be tested by the community
  • Stability issues with on-policy distillation have been discussed in literature—could the model gradually degrade during self-training?

The next phase of competition in video generation is not just about "who can generate more realistic videos," but "who can generate usable videos at a reasonable cost." AnyFlow makes a valuable attempt in this direction—but before reaching production-grade application, it will need at least another round of community reproduction and stress testing.


Primary Sources: