Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Abstract

Despite the impressive generative capabilities of diffusion models, existing diffusion model-based style transfer methods require inference-stage optimization (e.g. fine-tuning or textual inversion of style) which is time-consuming, or fails to leverage the generative ability of large-scale diffusion models.

To address these issues, we introduce a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization. Specifically, we manipulate the features of self-attention layers as the way the cross-attention mechanism works; in the generation process, substituting the key and value of content with those of style image. This approach provides several desirable characteristics for style transfer including 1) preservation of content by transferring similar styles into similar image patches and 2) transfer of style based on similarity of local texture (e.g. edge) between content and style images.

Furthermore, we introduce query preservation and attention temperature scaling to mitigate the issue of disruption of original content, and initial latent AdaIN to deal with the disharmonious color. Our experimental results demonstrate that our proposed method surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.

Method Overview

(Left) We first invert content image \(z^c_0\) and style image \(z^s_0\) into the latent noise space as \(z^c_T\) and \(z^s_T\), respectively. Then, we initialize the initial noise of stylized image \(z_T^{cs}\) from initial latent AdaIN which combines the content and style noise, \(z_T^c\) and \(z_T^s\). While performing the reverse diffusion process with \(z^{cs}_T\), we inject the information of content and style by attention-based style injection and attention temperature scaling.

(Right) Style injection is basically the manipulation of self-attention layer during the reverse diffusion process. Specifically, at time step \(t\), we substitute the key and value in SA of stylized image with those of style features. Then, we scale the magnitude of the attention map to deal with the magnitude decrease that the substitution of feature leads to. Initial latent AdaIN produces the initial noise \(z_T^{cs}\) by combining style noise \(z_T^s\) and content noise \(z_T^s\).

Results

Query Preservation (0.0~1.0)

Style

Loading...

Content

Attention Scaling (1.0~2.0)