Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.
In this paper, we incorporate this spectral evolution, or equivalently the evolution of the signal-to-noise ratio, into cache scheduling. SeaCache applies a Spectral-Evolution-Aware (SEA) Filter to raw diffusion features so that the distance measure better captures timestep-aware spectral residuals between timesteps.
Rather than treating all spectral components of diffusion feature equally, we design a cache metric that focuses on the signal component while downweighting the noise component. By grounding reuse decisions on discrepancies in the synthesized content, the resulting metric becomes less sensitive to high-frequency noise and encourages cache gating to respond to meaningful signal alignment rather than stochastic variation.
To validate this idea, we conduct an oracle experiment that compares cache schedules derived from raw feature distances with those derived from distances in a signal-emphasized space. In standard caching schemes, the decision to skip or compute is based on the distance between input features at consecutive timesteps. In our oracle analysis, we instead compare consecutive output features, thereby removing input-to-output approximation error and isolating the effect of spectral filtering. Specifically, we compare two criteria: one that measures distances after applying the SEA (Spectral-Evolution-Aware) filter, which downweights the noise component, and another that uses unfiltered raw outputs, as shown in below figure. The filtered criterion yields cache decisions that more closely track the full-compute trajectory, as evidenced by consistently higher PSNR. This suggests that spectrum-aware scheduling better preserves the behavior of the original model.
Latency-quality trade-off on FLUX
Latency-quality trade-off on Wan2.1 1.3B
To design a filter that reflects spectral evolution, we formalize the change of the effective frequency band across timesteps. By deriving an optimal linear denoiser, we propose a Spectral-Evolution-Aware (SEA) filter that effectively preserves content-relevant components while suppressing noise. The visualization of the SEA filter is presented below.
Optimal linear denoising filter for different \(t\)
Noramlized SEA filter for different \(t\)
Directly using SEA-filtered outputs in the cache metric is not practical, since the output is only available after a full denoiser run and thus offers no speedup. We therefore seek an input-side proxy that matches the SEA-filtered output distance as closely as possible. Building on the input features, introduced in SeaCache, we compare several candidates: raw input, raw output, the polynomial fitted input used in TeaCache, and their SEA-filtered counterparts obtained by applying from the proposed method.
Below figure reports the relative \(\ell_1\) distance between consecutive timesteps for these feature choices, averaged over ten samples on FLUX and Wan2.1 1.3B. The SEA-filtered input distances closely follow the SEA-filtered output distances along the entire trajectory, while raw input and polynomial fitted input show weaker alignment, especially at early timesteps.
Rel \(\ell_1\) across the gen process on FLUX
Rel \(\ell_1\) across the gen process on Wan2.1 1.3B
Given input features \(I_t\) and \(I_{t+1}\), SeaCache first applies FFT, multiplies by the timestep-dependent SEA filters \(G_t^{\mathrm{norm}}\) and \(G_{t+1}^{\mathrm{norm}}\), and then applies iFFT to obtain spectral-evolution-aware features \(\mathcal{P}(G_t^{\mathrm{norm}}, I_t)\) and \(\mathcal{P}(G_{t+1}^{\mathrm{norm}}, I_{t+1})\). A spectrum-aware dynamic caching module measures the relative distance \(\widetilde{\Delta}_t\) between consecutive filtered features, accumulates it over timesteps, and either reuses the cached output or refreshes the denoiser when the threshold \(\delta\) is exceeded. The underlying diffusion model remains unchanged, so SeaCache acts as a plug-and-play cache policy that replaces only the distance metric.
For all experiments, generated images and videos are saved as PNG and MP4 files, respectively. For text-to-image (T2I) generation, we evaluate 200 prompts from DrawBench and generate images at 1024\(\times\)1024 resolution in FLUX. For text-to-video (T2V) generation, we use 944 prompts from VBench and generate 480p videos with 65 frames per prompt in Wan2.1.
For each configuration, we use the full-timestep output of the original model as the reference. We compute PSNR, LPIPS, and SSIM between each cached sample and its reference, and then report the average over all samples. The initial random seed is shared across our method and all baselines. We consider two cache budgets, approximately 50% and 30%.
Quantitative Comparison in FLUX
| Method | Latency (s) | TFLOPs | PSNR ↑ | LPIPS ↓ | SSIM ↑ |
|---|---|---|---|---|---|
| Original (50 steps) | 20.9 | 2976 | - | - | - |
| Vanilla 25 steps | 10.5 | 1487 | 15.55 | 0.409 | 0.668 |
| Vanilla 15 steps | 6.4 | 892 | 17.84 | 0.305 | 0.740 |
| TeaCache (δ=0.3) | 11.4 | 1547 | 20.76 | 0.211 | 0.810 |
| TaylorSeer (S=3) | 9.8 | 1191 | 22.78 | 0.163 | 0.828 |
| SeaCache (δ=0.3) | 9.4 | 1098 | 26.28 | 0.106 | 0.893 |
| △-Dit | 15.5 | 1984 | 17.40 | 0.336 | 0.710 |
| ToCa | 15.9 | 1263 | 18.39 | 0.324 | 0.700 |
| TeaCache (δ=0.6) | 7.1 | 892 | 17.21 | 0.348 | 0.714 |
| TaylorSeer (S=5) | 7.5 | 834 | 19.97 | 0.236 | 0.762 |
| SeaCache (δ=0.6) | 6.4 | 773 | 21.33 | 0.226 | 0.798 |
Quantitative Comparison in Wan2.1 1.3B
| Method | Latency (s) | TFLOPs | PSNR ↑ | LPIPS ↓ | SSIM ↑ |
|---|---|---|---|---|---|
| Original (50 steps) | 176.3 | 8214 | - | - | - |
| TeaCache (δ=0.09) | 86.6 | 4107 | 20.84 | 0.171 | 0.721 |
| TaylorSeer (S=2) | 93.1 | 4189 | 16.15 | 0.336 | 0.543 |
| SeaCache (δ=0.2) | 83.9 | 3942 | 26.60 | 0.075 | 0.873 |
| TeaCache (δ=0.15) | 63.6 | 2957 | 18.88 | 0.245 | 0.645 |
| TaylorSeer (S=3) | 67.1 | 2956 | 14.18 | 0.455 | 0.453 |
| SeaCache (δ=0.35) | 56.6 | 2793 | 21.78 | 0.170 | 0.740 |
"A skateboard on the top of a surfboard front view"
"A person is running on treadmill"
"An orange and a clock"
"A person is tai chi"
"A person is sweeping floor"
"A panda cooking in the kitchen"
"Vampire makeup face of beautiful girl red contact lenses"
"A truck and a bicycle"
@article{chung2026seacache,
title={SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models},
author={Chung, Jiwoo and Hyun, Sangeek and Lee, MinKyu and Han, Byeongju and Cha, Geonho and Wee, Dongyoon and Hong, Youngjun and Heo, Jae-Pil},
journal={arXiv preprint arXiv:2602.18993},
year={2026}
}