TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Abstract

Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast- sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the “edit-friendly” DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.

What is this about?

We apply text-based editing methods like Edit-Friendly DDPM inversion to SDXL-Turbo and discover two shortcomings: (1) the appearance of visual artifacts, and (2) insufficient editing strength.

We trace the artifacts to misaligned noise statistics, and propose a time-shifting method to correct them. To improve editing strength, we analyze the Edit-Friendly equations and show that they can be broken into two components - one responsible for shifting the image between prompts, and one for shifting it between diffusion trajectories. We rescale the cross-prompt term and demonstrate that it increases editability without introducing novel artifacts. For additional details please see the paper.

Fixing Visual Artifacts

Following Edit-Friendly, we observe that the noise statistics of the inverted noise-maps deviate significantly from the expected values at each step. In many-step diffusion models, these statistics tend to converge towards the end of the diffusion proccess, and the model can deal with any artifacts introduced along the way. With SDXL-Turbo, these steps are entirely skipped, and artifacts remain.

We observe that the misaligned statistics are roughly time-shifted, with noise statistics matching the expected values at roughly 200 steps earlier. Hence, we simply provide both the scheduler and the model with a timestep parameter which is also shifted by 200 steps, eliminating the domain gap.

Edit-Friendly inversion with SDXL-Turbo leads to noise statistics (red) which are misaligned with the expected values (green). We propose a simple time-shift approach to re-align them (blue, purple), greatly reducing artifacts.

Pseudo-guidance

We analyze the Edit-Friendly equations and demonstrate that they can be decomposted into two terms - one which controls the strength of the prompt, and another which shifts the original images to a new trajectory. We propose to apply a CFG-style rescaling only to the prompt term, and demonstrate that it can indeed improve editing strength without introducing new artifacts. Please see the paper for more details.

Editing results when scaling the cross-prompt ($w_p$, columns) and the cross-trajectory ($w_t$, rows) terms. Scaling only the cross-prompt term leads to improved editing results, without artifacts.

Edit Friendly and Delta Denoising Score equivalence

Our investigation into the Edit-Friendly DDPM proccess reveals that it shares similar form to the corrections employed by Delta Denoising Score. Surprisingly, we prove that under an appropriate choice of learning rates and time-step sampling, the two methods are functionally equivalent and create the exact same results. This finding can also be extended to the recent Posterior Distillation Sampling (PDS) method, if applied to image editing.

Comparisons to Prior Work (Multi-step)

We compare our 4-step results to the existing state-of-the-art editing approaches in the many-step regieme. Our method can achieve better or comparable quality to the state-of-the-art, but it does so x6 faster than the fastest baseline, and up to x630 when compared to the top scoring method.

Comparisons to Prior Work (Few-step)

We further compare our method to few-step alternatives. We can better maintain the original image content while matching the semantic intent of the edit. Our method also avoids the visual artifacts that appear in the baseline Edit-Friendly approach.

BibTeX

If you find our work useful, please cite our paper:


@misc{deutch2024turboedittextbasedimageediting,
  title={TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models}, 
  author={Gilad Deutch and Rinon Gal and Daniel Garibi and Or Patashnik and Daniel Cohen-Or},
  year={2024},
  eprint={2408.00735},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2408.00735}, 
}