Boosting Generative Image Modeling via Joint Image-Feature Synthesis.pdf

The document presents a novel generative image modeling framework called ReDi, which aims to enhance image synthesis quality by integrating generative modeling with representation learning. Traditional latent diffusion models (LDMs) effectively generate high-quality images but struggle to seamlessly incorporate semantic understanding. To address this, ReDi employs a dual-space diffusion model that simultaneously captures low-level image details through variational autoencoder (VAE) latents and high-level semantic features utilizing a pretrained encoder like DINOv2. This joint modeling approach allows the diffusion model to learn coherent image-feature pairs from noise, significantly improving both generative performance and training efficiency while simplifying the training process by avoiding complex distillation objectives. Additionally, the framework introduces a novel inference technique, termed Representation Guidance, which utilizes learned semantic insights to refine generated images. Extensive evaluations demonstrate that ReDi not only accelerates convergence but also enhances image quality in both conditional and unconditional settings, marking a promising advancement in representation-aware generative modeling. The document further details the implementation, associated methodologies, and experimental results, establishing ReDi as a significant contribution to the field.

Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Theodoros KouzelisEfstathios Karypidis Archimedes, Athena RCArchimedes, Athena RC National Technical University of AthensNational Technical University of Athens Ioannis KakogeorgiouSpyros GidarisNikos Komodakis Archimedes, Athena RCvaleo.aiArchimedes, Athena RC IIT, NCSR "Demokritos"University of Crete IACM-Forth

Figure 1: ReDi: Our generative image modeling framework bridges the gap between generative modeling and representation learning by leveraging a diffusion model thatj ointly captures low-level image details (via VAE latents) and high-level semantic features (via DINOv2). Trained to generate coherent image–feature pairs from pure noise, this unified latent-semantic dual-space diffusion approach significantly boosts both generative quality and training convergence speed.

Abstract

Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model toj ointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image–feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling. Code available at https://github.com/zelaki/ReDi.

Figure 2: Accelerated Training Training curves (without Classifier-Free Guidance) for DiT-XL/2, SiT-XL/2 and SiT-XL/2+REPA, showing that our ReDi accelerates convergence by ×23 and ×6 (compared to DiT-XL/2 and SiT-XL/2+REPA, respectively).

1 Introduction

Latent diffusion models (LDMs) (Rombach et al., 2022) have emerged as a leading approach for high-quality image synthesis, achieving state-of-the-art results (Rombach et al., 2022; Yao et al., 2024; Ma et al., 2024). These models operate in two stages: first, a variational autoencoder (VAE) compresses images into a compact latent representation (Rombach et al., 2022; Kouzelis et al., 2025);second, a diffusion model learns the distribution of these latents, capturing their underlying structure.

Leveraging their intermediate features, pretrained LDMs have shown promise for various scene understanding tasks, including classification (Mukhopadhyay et al., 2023), pose estimation (Gong et al., 2023), and segmentation (Li et al., 2023b; Liu et al., 2023; Delatolas et al., 2025). However, their discriminative capabilities typically underperform specialized (self-supervised) representation learning approaches like masking-based (He et al., 2022), contrastive (Chen et al., 2020), selfdistillation (Caron et al., 2021), or vision-language contrastive (Radford et al., 2021a) methods. This limitation stems from the inherent tension in LDM training - the need to maintain precise low-level reconstruction while simultaneously developing semantically meaningful representations.

This observation raises a fundamental question: How can we leverage representation learning to enhance generative modeling? Recent work by Yu et al. (2025) (REPA) demonstrates that improving the semantic quality of diffusion features through distillation of pretrained self-supervised representations leads to better generation quality and faster convergence. Their results establish a clear connection between representation learning and generative performance.

Motivated by these insights, we investigate whether a more effective approach to leveraging representation learning can further enhance image generation performance. In this work, we contend that the answer is yes: rather than aligning diffusion features with external representations via distillation, we propose to jointly model both images (specifically their VAE latents) and their high-level semantic features extracted from a pretrained vision encoder (e.g., DINOv2 (Oquab et al., 2024)) within the same diffusion process. Formally, as shown in Figure 1, we define the forward diffusion process as \(q(\mathbf{x}_{t},\mathbf{z}_{t}|\mathbf{x}_{t-1},\mathbf{z}_{t-1})\) for \(t=1,...,T\), where \(\mathbf{x}_{0}=\mathbf{x}\) and \(\mathbf{z}_{0}\,=\,\mathbf{z}\) are the clean VAE latents and semantic features, respectively. The reverse process \(p_{\theta}(\mathbf{x}_{t-1},\mathbf{z}_{t-1}|\mathbf{x}_{t},\mathbf{z}_{t})\) learns to gradually denoise both modalities from Gaussian noise.

Thisj oint modeling approach forces the diffusion model to explicitly learn thej oint distribution of both precise low-level (VAE) and high-level semantic (DINOv2) features. We implement this approach, called ReDi, within the DiT (Peebles & Xie, 2023) and SiT (Ma et al., 2024) frameworks with minimal modifications to their transformer architecture: we apply standard diffusion noise to both representations, combine them into a single set of tokens, and train the standard diffusion transformer architecture to denoise both components simultaneously.

ChatDOC SEO