Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

(*: Equal contribution)
KAIST AI

Vanilla REPA* CREPA

PROMPT: "A black and white animated scene unfolds with a steamboat on a serene river or canal, surrounded by a dock-like structure and rocky shores. The boat emits dark smoke from two tall smokestacks as it moves, leaving a trail behind. As the steamboat accelerates, the smoke grows denser. It eventually disappears from view, and a character emerges from a nearby house-like structure, standing on a small pier, observing the surroundings in a simplistic, classic animation style."

Abstract

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability.

Results

Crush

Method Overview

Background

Existing REPA aligns the hidden states of Image Diffusion Transformers with pretrained visual features, improving convergence and generation quality. When REPA is applied to Video Diffusion Models (VDMs) as REPA*, it performs frame-wise alignment but fails to maintain semantic consistency across frames. Due to the inherent nature of denoising autoencoders (DAEs), hidden states extracted from noisy inputs can vary stochastically across frames, leading to semantic misalignment between adjacent frames.

Motivation from Empirical Observation

The motivation for Cross-frame Representation Alignment (CREPA) stems from empirical observations regarding the behavior of hidden states in Video Diffusion Models (VDMs). As shown in the figure below, we visualize the learned hidden states using CKNN-A (Continuous K-Nearest Neighbors Alignment) on pretrained feature manifolds. Under REPA*, the hidden states exhibit stochastic drifts and fail to follow a smooth trajectory across frames. In contrast, with CREPA, the hidden states more faithfully align with the pretrained feature manifold, leading to improved temporal semantic consistency and smoother cross-frame representations. This empirical observation supports our design choice to explicitly align hidden states with neighboring frames during fine-tuning.

CREPA Model Diagram

Core Idea of CREPA

CREPA explicitly aligns the hidden state of each frame with both its own pretrained feature and the features of adjacent frames. This encourages the hidden states to remain smooth and consistent along a temporal semantic manifold, improving cross-frame semantic consistency in generated videos.

CREPA Model Diagram

Implementation

REPA*

$$ \mathcal{L}_{align} = - \mathbb{E}_{x_0, \epsilon, t} \left[ \sum_f \text{sim}(\bar{y}^f, h_\phi(h^f_t)) \right] $$

Final objective:

$$ \mathcal{L} = \mathcal{L}_{score} + \lambda \mathcal{L}_{align} $$

CREPA

$$ \mathcal{L}_{align} = - \mathbb{E}_{x_0, \epsilon, t} \left[ \sum_f \left( \text{sim}(\bar{y}^f, h_\phi(h^f_t)) + \sum_{k \in K} e^{-\frac{|k-f|}{\tau}} \cdot \text{sim}(\bar{y}^k, h_\phi(h^f_t)) \right) \right] $$

Key Features

BibTeX

@misc{hwang2025crepa,
      title={Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models}, 
      author={Sungwon Hwang and Hyojin Jang and Kinam Kim and Minho Park and Jaegul choo},
      year={2025},
      eprint={2506.09229},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.09229}, 
}