SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

CVPR 2026

Jiwoo Chung^1,†, Sangeek Hyun¹, MinKyu Lee¹, Byeongju Han², Geonho Cha², Dongyoon Wee², Youngjun Hong^2,*, Jae-Pil Heo^1,*

¹Sungkyunkwan University ²NAVER Cloud

^† This work was done during an internship at NAVER Cloud. ^* Co-corresponding authors.

arXiv Code

SeaCache is a training-free acceleration method that leverages Spectral Evolution to decouple content-carrying low-frequency components from high-frequency noise.

Abstract

Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.

Overview

In this paper, we incorporate this spectral evolution, or equivalently the evolution of the signal-to-noise ratio, into cache scheduling. SeaCache applies a Spectral-Evolution-Aware (SEA) Filter to raw diffusion features so that the distance measure better captures timestep-aware spectral residuals between timesteps.

Motivation

Rather than treating all spectral components of diffusion feature equally, we design a cache metric that focuses on the signal component while downweighting the noise component. By grounding reuse decisions on discrepancies in the synthesized content, the resulting metric becomes less sensitive to high-frequency noise and encourages cache gating to respond to meaningful signal alignment rather than stochastic variation.

To validate this idea, we conduct an oracle experiment that compares cache schedules derived from raw feature distances with those derived from distances in a signal-emphasized space. In standard caching schemes, the decision to skip or compute is based on the distance between input features at consecutive timesteps. In our oracle analysis, we instead compare consecutive output features, thereby removing input-to-output approximation error and isolating the effect of spectral filtering. Specifically, we compare two criteria: one that measures distances after applying the SEA (Spectral-Evolution-Aware) filter, which downweights the noise component, and another that uses unfiltered raw outputs, as shown in below figure. The filtered criterion yields cache decisions that more closely track the full-compute trajectory, as evidenced by consistently higher PSNR. This suggests that spectrum-aware scheduling better preserves the behavior of the original model.

Latency-quality trade-off on FLUX

Latency-quality trade-off on Wan2.1 1.3B

Method

Spectral-Evolution-Aware Filter

To design a filter that reflects spectral evolution, we formalize the change of the effective frequency band across timesteps. By deriving an optimal linear denoiser, we propose a Spectral-Evolution-Aware (SEA) filter that effectively preserves content-relevant components while suppressing noise. The visualization of the SEA filter is presented below.

Optimal linear denoising filter for different \(t\)

Noramlized SEA filter for different \(t\)

Spectrum-Aware Dynamic Caching

Directly using SEA-filtered outputs in the cache metric is not practical, since the output is only available after a full denoiser run and thus offers no speedup. We therefore seek an input-side proxy that matches the SEA-filtered output distance as closely as possible. Building on the input features, introduced in SeaCache, we compare several candidates: raw input, raw output, the polynomial fitted input used in TeaCache, and their SEA-filtered counterparts obtained by applying from the proposed method.

Below figure reports the relative \(\ell_1\) distance between consecutive timesteps for these feature choices, averaged over ten samples on FLUX and Wan2.1 1.3B. The SEA-filtered input distances closely follow the SEA-filtered output distances along the entire trajectory, while raw input and polynomial fitted input show weaker alignment, especially at early timesteps.

Rel \(\ell_1\) across the gen process on FLUX

Rel \(\ell_1\) across the gen process on Wan2.1 1.3B

Framework of SeaCache

Given input features \(I_t\) and \(I_{t+1}\), SeaCache first applies FFT, multiplies by the timestep-dependent SEA filters \(G_t^{\mathrm{norm}}\) and \(G_{t+1}^{\mathrm{norm}}\), and then applies iFFT to obtain spectral-evolution-aware features \(\mathcal{P}(G_t^{\mathrm{norm}}, I_t)\) and \(\mathcal{P}(G_{t+1}^{\mathrm{norm}}, I_{t+1})\). A spectrum-aware dynamic caching module measures the relative distance \(\widetilde{\Delta}_t\) between consecutive filtered features, accumulates it over timesteps, and either reuses the cached output or refreshes the denoiser when the threshold \(\delta\) is exceeded. The underlying diffusion model remains unchanged, so SeaCache acts as a plug-and-play cache policy that replaces only the distance metric.

Quantitative Results

For all experiments, generated images and videos are saved as PNG and MP4 files, respectively. For text-to-image (T2I) generation, we evaluate 200 prompts from DrawBench and generate images at 1024\(\times\)1024 resolution in FLUX. For text-to-video (T2V) generation, we use 944 prompts from VBench and generate 480p videos with 65 frames per prompt in Wan2.1.

For each configuration, we use the full-timestep output of the original model as the reference. We compute PSNR, LPIPS, and SSIM between each cached sample and its reference, and then report the average over all samples. The initial random seed is shared across our method and all baselines. We consider two cache budgets, approximately 50% and 30%.

Quantitative Comparison in FLUX

Method	Latency (s)	TFLOPs	PSNR ↑	LPIPS ↓	SSIM ↑
Original (50 steps)	20.9	2976	-	-	-
Vanilla 25 steps	10.5	1487	15.55	0.409	0.668
Vanilla 15 steps	6.4	892	17.84	0.305	0.740
TeaCache (δ=0.3)	11.4	1547	20.76	0.211	0.810
TaylorSeer (S=3)	9.8	1191	22.78	0.163	0.828
SeaCache (δ=0.3)	9.4	1098	26.28	0.106	0.893
△-Dit	15.5	1984	17.40	0.336	0.710
ToCa	15.9	1263	18.39	0.324	0.700
TeaCache (δ=0.6)	7.1	892	17.21	0.348	0.714
TaylorSeer (S=5)	7.5	834	19.97	0.236	0.762
SeaCache (δ=0.6)	6.4	773	21.33	0.226	0.798

Quantitative Comparison in Wan2.1 1.3B

Method	Latency (s)	TFLOPs	PSNR ↑	LPIPS ↓	SSIM ↑
Original (50 steps)	176.3	8214	-	-	-
TeaCache (δ=0.09)	86.6	4107	20.84	0.171	0.721
TaylorSeer (S=2)	93.1	4189	16.15	0.336	0.543
SeaCache (δ=0.2)	83.9	3942	26.60	0.075	0.873
TeaCache (δ=0.15)	63.6	2957	18.88	0.245	0.645
TaylorSeer (S=3)	67.1	2956	14.18	0.455	0.453
SeaCache (δ=0.35)	56.6	2793	21.78	0.170	0.740

Qualitative Results

FLUX-Dev

HunyuanVideo

Refresh ratio: 50%

"A skateboard on the top of a surfboard front view"

"A person is running on treadmill"

Refresh ratio: 30%

"An orange and a clock"

"A person is tai chi"

Wan2.1 1.3B

Refresh ratio: 50%

"A person is sweeping floor"

"A panda cooking in the kitchen"

Refresh ratio: 30%

"Vampire makeup face of beautiful girl red contact lenses"

"A truck and a bicycle"

BibTeX

@article{chung2026seacache,
  title={SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models},
  author={Chung, Jiwoo and Hyun, Sangeek and Lee, MinKyu and Han, Byeongju and Cha, Geonho and Wee, Dongyoon and Hong, Youngjun and Heo, Jae-Pil},
  journal={arXiv preprint arXiv:2602.18993},
  year={2026}
}