Naive DDIM inversion process usually suffers from a trajectory deviation issue, i.e., the latent trajectory during reconstruction deviates from the one during inversion. To alleviate this issue, previous methods either learn to mitigate the deviation or design cumbersome compensation strategy to reduce the mismatch error, exhibiting substantial time and computation cost. In this work, we present a nearly free-lunch method (named FreeInv) to address the issue more effectively and efficiently. In FreeInv, we randomly transform the latent representation and keep the transformation the same between the corresponding inversion and reconstruction time-step. It is motivated from a statistical perspective that an ensemble of DDIM inversion processes for multiple trajectories yields a smaller trajectory mismatch error on expectation. Moreover, through theoretical analysis and empirical study, we show that FreeInv performs an efficient ensemble of multiple trajectories. FreeInv can be freely integrated into existing inversion-based image and video editing techniques. Especially for inverting video sequences, it brings more significant fidelity and efficiency improvements. Comprehensive quantitative and qualitative evaluation on PIE benchmark and DAVIS dataset shows that FreeInv remarkably outperforms conventional DDIM inversion, and is competitive among previous state-of-the-art inversion methods, with superior computation efficiency.
We conduct a comparison with state-of-the-art inversion-enhancing techniques, covering
Thanks to operational simplicity, FreeInv can be readily plugged into existing inversion-based image editing frameworks. Besides P2P, we compare the image editing results of PnP [8] and MasaCtrl [9] with and without FreeInv.
We integrate FreeInv into a representative inversion-based video editing method, TokenFlow [10]. The video editing results are presented as follow.
Input | "Lionel Messi" | "LeBron James" | "Will Smith" |
---|---|---|---|
Input | "Pixar Animation" | "A Tiger" | "An Orange Cat" |
---|---|---|---|
Input | "8-bit pixel art" | "A marble sculpture" | "Pixar animation" |
---|---|---|---|
Input | "In the forest" | "Brown trousers" | "A silver robot" |
---|---|---|---|
Input | "A car drifting on the ice" |
---|---|
We compare video editing results among
"Lionel Messi" | Ours | TokenFlow ([10]) | STEM-Inv ([11]) |
---|---|---|---|
"Pixar animation" | Ours | TokenFlow ([10]) | STEM-Inv ([11]) |
---|---|---|---|
"A black SUV" | Ours | TokenFlow ([10]) | STEM-Inv ([11]) |
---|---|---|---|
We make a comparison of reconstruction results between DDIM inversion and FreeInv. Additionally, we show the editing results with the inverted latent, denoted as DDIM editing and FreeInv editing, respectively. The visualization demonstrates that FreeInv boosts the reconstruction fidelity and further benefits editing quality.
@article{bao2025freeinv,
title={FreeInv: Free Lunch for Improving DDIM Inversion},
author={Bao, Yuxiang and Liu, Huijie and Gao, Xun and Fu, Huan and Kang, Guoliang},
journal={arXiv preprint arXiv:2503.23035},
year={2025}
}
[1] Ron Mokady, et al. "Null-text inversion for editing real images using guided diffusion models." In CVPR, 2023.
[2] Bram Wallace, et al. "Edict: Exact diffusion inversion via coupled transformations." In CVPR, 2023.
[3] Inbar Huberman-Spiegelglas, et al. "An edit friendly ddpm noise space: Inversion and manipulations." In CVPR, 2024.
[4] Sihan Xu, et al. "Inversion-free image editing with natural language." In CVPR, 2024.
[5] Fangyikang Wang, et al. "BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models." In NeurIPS, 2024.
[6] Xuan Ju, et al. "Pnp inversion: Boosting diffusion-based editing with 3 lines of code." In ICLR, 2024.
[7] Amir Hertz, et al. "Prompt-to-prompt image editing with cross attention control." In ICLR, 2023.
[8] Narek Tumanyan, et al. "Plug-and-play diffusion features for text-driven image-to-image translation." In CVPR, 2023.
[9] Mingdeng Cao, et al. "Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing." In ICCV, 2023.
[10] Geyer, Michal, et al. "Tokenflow: Consistent diffusion features for consistent video editing." In ICLR, 2024.
[11] Maomao Li, et al. "A video is worth 256 bases: Spatial-temporal expectation-maximization inversion for zero-shot video editing." In CVPR, 2024.