TL;DR: A simple yet effective method of improving fine-tuned diffusion models by leveraging the unconditional priors from its base model.

Fine-tuned conditional diffusion models often show drastic degradation in their unconditional priors, adversely affecting conditional generation when using techniques such as CFG. We demonstrate that leveraging a diffusion model with a richer unconditional prior and combining its unconditional noise prediction with the conditional noise prediction from the fine-tuned model can lead to substantial improvements in conditional generation quality. This is demonstrated across diverse conditional diffusion models including Zero-1-to-3, Versatile Diffusion, InstructPix2Pix, and DynamiCrafter.

Abstract

Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

💡 Main Idea

Poor Unconditional Priors Affect Conditional Generation

Classifier-Free Guidance (CFG) jointly models the unconditional and conditional noise predictions using a single noise prediction network by randomly dropping the condition according to a small drop rate (typically 5~20%). At inference time, CFG uses a linear combination of unconditional and conditional noises to perform sampling. Although CFG is commonly used to fine-tune existing large-scale diffusion models to incorporate new types of input conditions, CFG fine-tuning significantly degrades the quality of its unconditional generation which in turn also affects the quality of conditional generation: while the base model (that the fine-tuned model is fine-tuned from) exhibits high quality unconditional samples, the unconditional samples from the fine-tuned models are low quality and lack semantic information.

Finding Richer Unconditional Priors

Can we improve the quality of conditional generation by incorporating better unconditional priors during sampling? We already have access to a diffusion model with reliable unconditional priors: the base model it was fine-tuned from. We propose a simple yet effective fix: combining the unconditional noise prediction from the base model with the conditional noise prediction from the fine-tuned model. This simple modification results in significant improvements in the output quality of conditional generation.

Choice of Unconditional Prior

Should the unconditional noise come from the base model or from another unconditional model? Notably, we find that the base model does not necessarily need to the be the true base model from which the new conditional model was fine-tuned from, but can instead be another diffusion model with good unconditional priors. In our experiments, even though some models have been fine-tuned from SD1.x, using the unconditional predictions of SD2.1 or even PixArt-α (which is a DiT rather than a UNet) results in improvements as well.

🖼️ Qualitative Results

Our method improves the quality of various fine-tuned diffusion models across different tasks.

👁️ Novel View Synthesis: Zero-1-to-3 [1]

🏞️ Image Variation: Versatile Diffusion [2]

📊 Class Conditional Generation: Fine-tuned DiT [3]

🎥 Video Generation: DynamiCrafter [4]

Prompt
DynamiCrafter
Ours

"A woman carrying a bundle of plants over their head.”

"A group of people riding bikes down a street.”

"A man in a black suit and a sombrero, shouting loudly.”

✂️ Image Editing: InstructPix2Pix [5]

✨ Additional Qualitative Results

Zero-1-to-3

Versatile Diffusion

DynamiCrafter

Prompt
DynamiCrafter
Ours

"A couple of horses are running in the dirt.”

"A man sitting on steps playing an acoustic guitar.”

"Two women eating pizza at a restaurant.”

InstructPix2Pix

Acknowledgement

We thank Juil Koo for providing constructive feedback of our manuscript.

References

[1] Zero-1-to-3: Zero-shot One Image to 3D Object, Liu et al., ICCV 2023
[2] Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, Xu et al., ICCV 2023
[3] Scalable Diffusion Models with Transformers, Peebles and Xie, ICCV 2023 (Oral)
[4] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors, Xing et al., ECCV 2024 (Oral)
[5] InstructPix2Pix: Learning to Follow Image Editing Instructions, Brooks et al., CVPR 2023 (Highlight)

BibTeX

@misc{phunyaphibarn2025unconditionalpriorsmatterimproving,
            title={Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models}, 
            author={Prin Phunyaphibarn and Phillip Y. Lee and Jaihoon Kim and Minhyuk Sung},
            year={2025},
            eprint={2503.20240},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2503.20240}, 
      }