Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Sumedha Arya
DOI Link: https://doi.org/10.22214/ijraset.2026.77372
Certificate: View Certificate
Text-to-image generation is a technique of creating images based on text descriptions. Recently, so many research publications has been done in this area, showing its popularity. In this work, we reviewed various autoregressive models, non-autoregressive models, GANs, energy-based models, multimodal methods and diffusion models used for text to image generation tasks. We also discuss important techniques commonly used in these models, such as autoencoders, attention mechanisms, and classifier-free guidance. For application, we performed a comparative analysis of diffusion and autoencoder models for text to image generation tasks taking Flowers-HD5 dataset. The results shows that the autoencoder achieves rapid convergence and significantly lower reconstruction loss (~0.01 range), producing sharp and faithful results. While the diffusion model, despite higher loss (~0.1–0.25), generates images with greater diversity.
Summary:
This paper reviews and compares major text-to-image generation techniques, focusing on modern generative AI models such as Autoregressive (AR), Non-Autoregressive (NAR), GAN, and Diffusion models. With rapid advancements in deep learning, especially since 2016, text-to-image generation has significantly improved in quality, realism, and scalability.
Text-to-image generation is conditional, meaning images are created based on textual input. Early models relied on captions, but modern systems use advanced architectures such as Transformers, attention mechanisms, and latent diffusion models.
Autoregressive (AR) Models
AR models generate images sequentially, predicting one token (pixel or patch) at a time using the chain rule of probability. Transformers greatly improved AR performance (e.g., DALL·E, CogView, Parti). Although capable of producing high-quality images, AR models are slow due to their sequential nature.
Non-Autoregressive (NAR) Models
NAR models generate multiple image components in parallel, significantly speeding up inference. Examples include MaskGIT and Muse. However, image quality may sometimes be inferior compared to AR or diffusion approaches.
Generative Adversarial Networks (GANs)
GANs use a generator–discriminator framework trained adversarially. They produce sharp and realistic images but often suffer from unstable training and mode collapse. Many improvements (e.g., StyleGAN-T, GigaGAN) address alignment and stability issues.
Diffusion Models
Diffusion models are currently the most dominant approach. They gradually add noise to images (forward process) and learn to remove noise step-by-step (reverse process). Innovations such as DDPM, Latent Diffusion Models (LDM), SDXL, and distillation techniques improved efficiency and stability. Diffusion models are preferred due to high-quality output and stable training compared to GANs.
Autoencoders (AE, VAE, VQ-VAE, VQ-GAN): Compress images into latent representations for efficient training and generation.
Text Encoding: Tokenization methods (BPE, WordPiece, SentencePiece) combined with encoders like BERT, T5, and CLIP.
Attention Mechanisms: Self-attention and cross-attention align text tokens with image regions, improving semantic consistency.
Classifier-Free Guidance (CFG): Enhances alignment between text and generated images by combining conditional and unconditional predictions during inference.
The study compares two models using a Flowers dataset:
1. Attention-Based Diffusion Model
Uses a U-Net architecture with attention layers.
Images are normalized to [-1,1]; text is converted to embeddings.
Follows diffusion training (noise addition and denoising).
Trained for 10 epochs using Adam optimizer.
Focused on generative image synthesis.
2. Conditional Autoencoder
Uses a convolutional encoder-decoder structure.
Text embeddings are projected and concatenated with image latent features.
Reconstructs images rather than generating from noise.
Trained for 10 epochs with stable convergence.
Simpler architecture without attention or residual connections.
Diffusion Model:
Starts with high loss due to noise learning.
Loss decreases significantly but fluctuates due to stochastic noise addition.
Slower training but better suited for high-quality image generation.
Autoencoder:
Starts with lower loss and converges quickly.
Stable and faster training.
Limited generative diversity and may plateau early.
This study compared a conditional autoencoder and an attention-based diffusion model for text-conditioned flower images generation. The autoencoder learned quickly and reconstructed images accurately with low error. But the kind of images generated were not new or diverse. The diffusion model trained more slowly and had higher loss, but it was able to generate completely new images that matched the text descriptions. Overall, the autoencoder is fast and efficient for reconstruction tasks, while the diffusion model is more powerful for text-to-image generation.
[1] D. Bahdanau, K. Cho, and Y. Bengio, \"Neural machine translation by jointly learning to align and translate,\" arXiv preprint arXiv:1409.0473, 2014. [2] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis, M. Aittala, T. Aila, S. Laine, et al., \"ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers,\" arXiv preprint arXiv:2211.01324, 2022. [3] Y. Bengio, R. Ducharme, and P. Vincent, \"A neural probabilistic language model,\" Advances in neural information processing systems, vol. 13, 2000. [4] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., \"Improving image generation with better captions,\" Computer Science, 2023. https://cdn. openai. com/papers/dall-e-3. pdf 2, 3 (2023), 8. [5] H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M. Yang, K. Murphy, W. Freeman, M. Rubinstein, et al., \"Muse: Text-to-image generation via masked generative transformers,\" arXiv preprint arXiv:2301.00704, 2023. [6] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. Freeman, \"Maskgit: Masked generative image transformer,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11315–11325. [7] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li, \"Pixart-?: Weak-to-strong training of diffusion transformer for 4k text-to-image generation,\" arXiv preprint arXiv:2403.04692, 2024. [8] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al., \"Pixart-?: Fast training of diffusion transformer for photorealistic text-to-image synthesis,\" arXiv preprint arXiv:2310.00426, 2023. [9] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, \"Generative pretraining from pixels,\" in International conference on machine learning, PMLR, 2020, pp. 1691–1703. [10] W. Chen, H. Hu, C. Saharia, and W. Cohen, \"Re-imagen: Retrieval-augmented text-to-image generator,\" arXiv preprint arXiv:2209.14491, 2022. [11] H. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., \"Scaling instruction-finetuned language models,\" Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024. [12] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. Bharath, \"Generative adversarialnetworks: An overview,\" IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018. [13] K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff, \"Vqgan-clip: Open domain image generation and editing with natural language guidance,\" in European Conference on Computer Vision, Springer, 2022, pp. 88–105. [14] S. Datta, A. Ku, D. Ramachandran, and P. Anderson, \"Prompt expansion for adaptive text-to-image generation,\" arXiv preprint arXiv:2312.16720, 2023. [15] J. Devlin, M. Chang, K. Lee, and K. Toutanova, \"Bert: Pre-training of deep bidirectional transformers for language understanding,\" arXiv preprint arXiv:1810.04805, 2018. [16] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., \"Cogview: Mastering text-to-image generation via transformers,\" Advances in neural information processing systems, vol. 34, pp. 19822–19835, 2021. [17] M. Ding, W. Zheng, W. Hong, and J. Tang, \"Cogview2: Faster and better text-to-image generation via hierarchical transformers,\" Advances in Neural Information Processing Systems, vol. 35, pp. 16890–16902, 2022. [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., \"An image is worth 16x16 words: Transformers for image recognition at scale,\" arXiv preprint arXiv:2010.11929, 2020. [19] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al., \"Scaling rectified flow transformers for high-resolution image synthesis,\" in Forty-first International Conference on Machine Learning, 2024. [20] P. Esser, R. Rombach, A. Blattmann, and B. Ommer, \"Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis,\" Advances in neural information processing systems, vol. 34, pp. 3518–3532, 2021. [21] P. Esser, R. Rombach, and B. Ommer, \"Taming transformers for high-resolution image synthesis,\" in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12873–12883. [22] W. Fan, Y. Chen, D. Chen, Y. Cheng, L. Yuan, and Y. Wang, \"Frido: Feature pyramid diffusion for complex scene image synthesis,\" in Proceedings of the AAAI conference on artificial intelligence, vol. 37, 2023, pp. 579–587. [23] Z. Feng, R. Hu, L. Liu, F. Zhang, D. Tang, Y. Dai, X. Feng, J. Li, B. Qin, and S. Shi, \"Emage: Non-Autoregressive Text-to-Image Generation,\" arXiv preprint arXiv:2312.14988, 2023. [24] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin, S. Feng, et al., \"Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10135–10145. [25] O. Gafni, A. Polyak, and Y. Taigman, \"Scene-Based Text-to-Image Generation with Human Priors,\" US Patent App. 18/149,542, 2024. [26] M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer, \"Mask-predict: Parallel decoding of conditional masked language models,\" arXiv preprint arXiv:1904.09324, 2019. [27] R. Gray, \"Vector quantization,\" IEEE Assp Magazine, vol. 1, no. 2, pp. 4–29, 1984. [28] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, \"Draw: A recurrent neural network for image generation,\" in International conference on machine learning, PMLR, 2015, pp. 1462–1471. [29] J. Gu, J. Bradbury, C. Xiong, V. Li, and R. Socher, \"Non-autoregressiveneural machine translation,\" arXiv preprint arXiv:1711.02281, 2017. [30] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, \"Vector quantized diffusion model for text-to-image synthesis,\" in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10696–10706. [31] L. Guo, Y. He, H. Chen, M. Xia, X. Cun, Y. Wang, S. Huang, Y. Zhang, X. Wang, Q. Chen, et al., \"Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation,\" arXiv preprint arXiv:2402.10491, 2024. [32] J. Ho, A. Jain, and P. Abbeel, \"Denoising diffusion probabilistic models,\" Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020. [33] J. Ho and T. Salimans, \"Classifier-free diffusion guidance,\" arXiv preprint arXiv:2207.12598, 2022. [34] V. Hu, S. Baumann, M. Gui, O. Grebenkova, P. Ma, J. Fischer, and B. Ommer, \"ZigMa: A DiT-style Zigzag Mamba Diffusion Model,\" arXiv preprint arXiv:2403.13802, 2024. [35] L. Huang, R. Fang, A. Zhang, G. Song, S. Liu, Y. Liu, and H. Li, \"FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis,\" arXiv preprint arXiv:2403.12963, 2024. [36] F. Jelinek, \"Interpolated estimation of Markov source parameters from sparse data,\" in Proc. Workshop on Pattern Recognition in Practice, 1980. [37] Z. Jiang, C. Mao, Y. Pan, Z. Han, and J. Zhang, \"Scedit: Efficient and controllable image diffusion generation via skip connection editing,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8995–9004. [38] M. Kang, R. Zhang, C. Barnes, S. Paris, S. Kwak, J. Park, E. Shechtman, J. Zhu, and T. Park, \"Distilling Diffusion Models into Conditional GANs,\" arXiv preprint arXiv:2405.05967, 2024. [39] M. Kang, J. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park, \"Scaling up gans for text-to-image synthesis,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10124–10134. [40] Y. Kim, C. Denton, L. Hoang, and A. Rush, \"Structured attention networks,\" arXiv preprint arXiv:1702.00887, 2017. [41] D. Kingma and M. Welling, \"Auto-encoding variational bayes,\" arXiv preprint arXiv:1312.6114, 2013. [42] D. Kingma, M. Welling, et al., \"An introduction to variational autoencoders,\" Foundations and Trends® in Machine Learning, vol. 12, no. 4, pp. 307–392, 2019. [43] J. Koh, J. Baldridge, H. Lee, and Y. Yang, \"Text-to-image generation grounded by fine-graineduser attention,\" in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 237–246. [44] T. Kudo, \"Subword regularization: Improving neural network translation models with multiple subword candidates,\" arXiv preprint arXiv:1804.10959, 2018. [45] T. Kudo and J. Richardson, \"Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,\" arXiv preprint arXiv:1808.06226, 2018. [46] D. Lee, C. Kim, S. Kim, M. Cho, and W. Han, \"Autoregressive image generation using residual quantization,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11523–11532. [47] S. Lee, Y. Li, J. Ke, I. Yoo, H. Zhang, J. Yu, Q. Wang, F. Deng, G. Entis, J. He, et al., \"Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation,\" arXiv preprint arXiv:2401.05675, 2024. [48] S. Li, J. Fu, K. Liu, W. Wang, K. Lin, and W. Wu, \"CosmicMan: A Text-to-Image Foundation Model for Humans,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 6955–6965. [49] W. Li, X. Xu, J. Liu, and X. Xiao, \"UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion,\" arXiv preprint arXiv:2401.13388, 2024. [50] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. Lee, \"Gligen: Open-set grounded text-to-image generation,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22511–22521. [51] J. Lin, R. Men, A. Yang, C. Zhou, M. Ding, Y. Zhang, P. Wang, A. Wang, L. Jiang, X. Jia, et al., \"M6: A chinese multimodal pretrainer,\" arXiv preprint arXiv:2103.00823, 2021. [52] S. Lin and X. Yang, \"Diffusion Model with Perceptual Loss,\" arXiv preprint arXiv:2401.00110, 2023. [53] L. Lingle, \"Transformer-vq: Linear-time transformers via vector quantization,\" arXiv preprint arXiv:2309.16354, 2023. [54] X. Liu, C. Gong, and Q. Liu, \"Flow straight and fast: Learning to generate and transfer data with rectified flow,\" arXiv preprint arXiv:2209.03003, 2022. [55] X. Liu, C. Gong, L. Wu, S. Zhang, H. Su, and Q. Liu, \"Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization,\" arXiv preprint arXiv:2112.01573, 2021. [56] X. Liu, X. Zhang, J. Ma, J. Peng, et al., \"Instaflow: One step is enough for high-quality diffusion-based text-to-image generation,\" in The Twelfth International Conference on Learning Representations, 2023. [57] G. Lu, Y. Guo, J. Han, M. Niu, Y. Zeng, S. Xu, Z. Huang, Z. Zhong, W. Zhang, and H. Xu, \"PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion,\" arXiv preprint arXiv:2312.16486, 2023. [58] S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, \"Latent consistency models: Synthesizing high-resolution images with few-step inference,\" arXiv preprint arXiv:2310.04378, 2023. [59] Y. Luo, X. Chen, and J. Tang, \"You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs,\" arXiv preprint arXiv:2403.12931, 2024. [60] E. Mansimov, E. Parisotto, J. Ba, and R. Salakhutdinov, \"Generating images from captions with attention,\" arXiv preprint arXiv:1511.02793, 2015. [61] R. Mishra and A. Subramanyam, \"Scene Graph to Image Synthesis: Integrating CLIP Guidance with Graph Conditioning in Diffusion Models,\" arXiv preprint arXiv:2401.14111, 2024. [62] S. Narasimhaswamy, U. Bhattacharya, X. Chen, I. Dasgupta, S. Mitra, and M. Hoai, \"Handiffuser: Text-to-image generation with realistic hand appearances,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2468–2479. [63] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski, \"Plug & play generative networks: Conditional iterative generation of images in latent space,\" in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4467–4477. [64] T. Nguyen and A. Tran, \"Swiftbrush: One-step text-to-image diffusion model with variational score distillation,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7807–7816. [65] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, \"Glide: Towards photorealistic image generation and editing with text-guided diffusion models,\" arXiv preprint arXiv:2112.10741, 2021. [66] O. Oertell, J. Chang, Y. Zhang, K. Brantley, and W. Sun, \"Rl for consistency models: Faster reward guided text-to-image generation,\" arXiv preprint arXiv:2404.03673, 2024. [67] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, \"Image transformer,\" in International conference on machine learning, PMLR, 2018, pp. 4055–4064. [68] S. Patil, W. Berman, R. Rombach, and P. von Platen, \"amused: An open muse reproduction,\" arXiv preprint arXiv:2401.01808, 2024. [69] A. Pelykh, O. Sincan, and R. Bowden, \"Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation,\" arXiv preprint arXiv:2403.10731, 2024. [70] P. Pernias, D. Rampas, M. Richter, C. Pal, and M. Aubreville, \"Würstchen: An efficient architecture for large-scale text-to-image diffusion models,\" arXiv preprint arXiv:2306.00637, 2023. [71] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, \"Sdxl: Improving latent diffusion models for high-resolution image synthesis,\" arXiv preprint arXiv:2307.01952, 2023. [72] J. Qin, J. Wu, W. Chen, Y. Ren, H. Li, H. Wu, X. Xiao, R. Wang, and S. Wen, \"Diffusiongpt: LLM-driven text-to-image generation system,\" arXiv preprint arXiv:2401.10061, 2024. [73] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., \"Learning transferable visual models from natural language supervision,\" in International conference on machine learning, PMLR, 2021, pp. 8748–8763. [74] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. Liu, \"Exploring the limits of transfer learning with a unified text-to-text transformer,\" Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020. [75] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, \"Hierarchicaltext-conditional image generation with clip latents,\" arXiv preprint arXiv:2204.06125, 2022. [76] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, \"Zero-shot text-to-image generation,\" in International conference on machine learning, PMLR, 2021, pp. 8821–8831. [77] A. Razavi, A. Van den Oord, and O. Vinyals, \"Generating diverse high-fidelity images with vq-vae-2,\" Advances in neural information processing systems, vol. 32, 2019. [78] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, \"Generative adversarial text to image synthesis,\" in International conference on machine learning, PMLR, 2016, pp. 1060–1069. [79] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, \"High-resolution image synthesis with latent diffusion models,\" in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695. [80] O. Ronneberger, P. Fischer, and T. Brox, \"U-net: Convolutional networks for biomedical image segmentation,\" in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, 2015, pp. 234–241. [81] S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, and E. Chen, \"Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis,\" in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13960–13969. [82] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, K. Ghasemipour, R. Lopes, B. Ayan, T. Salimans, et al., \"Photorealistic text-to-image diffusion models with deep language understanding,\" Advances in neural information processing systems, vol. 35, pp. 36479–36494, 2022. [83] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, \"Improved techniques for training gans,\" Advances in neural information processing systems, vol. 29, 2016. [84] S. Särkkä and A. Solin, \"Applied stochastic differential equations,\" vol. 10, Cambridge University Press, 2019. [85] A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, \"Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis,\" in International conference on machine learning, PMLR, 2023, pp. 30105–30118. [86] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, \"Adversarial diffusion distillation,\" arXiv preprint arXiv:2311.17042, 2023. [87] R. Sennrich, B. Haddow, and A. Birch, \"Neural machine translation of rare words with subword units,\" arXiv preprint arXiv:1508.07909, 2015. [88] S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y. Taigman, \"Knn-diffusion: Image generation via large-scale retrieval,\" arXiv preprint arXiv:2204.02849, 2022. [89] C. Si, Z. Huang, Y. Jiang, and Z. Liu, \"Freeu: Free lunch in diffusion u-net,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4733–4743. [90] J. Song, C. Meng, and S. Ermon, \"Denoising diffusion implicit models,\" arXiv preprint arXiv:2010.02502, 2020. [91] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, \"Consistency models,\" arXiv preprint arXiv:2303.01469, 2023. [92] Y. Song, J. Sohl-Dickstein, D. Kingma, A. Kumar, S. Ermon, and B. Poole, \"Score-based generative modeling through stochastic differential equations,\" arXiv preprint arXiv:2011.13456, 2020. [93] J. Su, S. Gu, Y. Duan, X. Chen, and J. Luo, \"Text2Street: Controllable Text-to-image Generation for Street Views,\" arXiv preprint arXiv:2402.04504, 2024. [94] K. Sueyoshi and T. Matsubara, \"Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8651–8660. [95] M. Tao, H. Tang, F. Wu, X. Jing, B. Bao, and C. Xu, \"Df-gan: A simple and effective baseline for text-to-image synthesis,\" in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16515–16525. [96] A. Vahdat, E. Andriyash, and W. Macready, \"Dvae#: Discrete variational autoencoders with relaxed boltzmann priors,\" Advances in Neural Information Processing Systems, vol. 31, 2018. [97] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., \"Conditional image generation with pixelcnn decoders,\" Advances in neural information processing systems, vol. 29, 2016. [98] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, \"Pixel recurrent neural networks,\" in International conference on machine learning, PMLR, 2016, pp. 1747–1756. [99] A. Van Den Oord, O. Vinyals, et al., \"Neural discrete representation learning,\" Advances in neural information processing systems, vol. 30, 2017. [100] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, ?. Kaiser, and I. Polosukhin, \"Attention is all you need,\" Advances in neural information processing systems, vol. 30, 2017. [101] X. Wang, J. Kontkanen, B. Curless, S. Seitz, I. Kemelmacher-Shlizerman, B. Mildenhall, P. Srinivasan, D. Verbin, and A. Holynski, \"Generative powers of ten,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7173–7182. [102] Z. Wang, E. Xie, A. Li, Z. Wang, X. Liu, and Z. Li, \"Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation,\" arXiv preprint arXiv:2401.15688, 2024. [103] C. Wu, J. Liang, L. Ji, F. Yang, Y. Fang, D. Jiang, and N. Duan, \"Nüwa: Visual synthesis pre-training for neural visual world creation,\" in European conference on computer vision, Springer, 2022, pp. 720–736. [104] X. Xu, Z. Wang, G. Zhang, K. Wang, and H. Shi, \"Versatile diffusion: Text, images and variations all in one diffusion model,\" in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7754–7765. [105] Y. Xu, Y. Zhao, Z. Xiao, and T. Hou, \"Ufogen: You forward once large scale text-to-image generation via diffusion gans,\" in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8196–8206. [106] Z. Xue, G. Song, Q. Guo, B. Liu, Z. Zong, Y. Liu, and P. Luo, \"Raphael: Text-to-image generation via large mixture of diffusion paths,\" Advances in Neural Information Processing Systems, vol. 36, 2024. [107] L. Yang, J. Liu, S. Hong, Z. Zhang, Z. Huang, Z. Cai, W. Zhang, and B. Cui, \"Improving diffusion-based image synthesis with context prediction,\" Advances in Neural Information Processing Systems, vol. 36, 2024. [108] L. Yang, Z. Yu, C. Meng, M. Xu, S. Ermon, and C. Bin, \"Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms,\" in Forty-first International Conference on Machine Learning, 2024. [109] Y. Yang, L. Wang, D. Xie, C. Deng, and D. Tao, \"Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis,\" IEEE Transactions on Image Processing, vol. 30, pp. 2798–2809, 2021. [110] J. Yu, X. Li, J. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu, J. Baldridge, and Y. Wu, \"Vector-quantized image modeling with improved vqgan,\" arXiv preprint arXiv:2110.04627, 2021. [111] J. Yu, Y. Xu, J. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, et al., \"Scaling autoregressive models for content-rich text-to-image generation,\" arXiv preprint arXiv:2206.10789, 2022. [112] H. Zhang, J. Koh, J. Baldridge, H. Lee, and Y. Yang, \"Cross-modal contrastive learning for text-to-image generation,\" in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 833–842. [113] H. Zhang, W. Yin, Y. Fang, L. Li, B. Duan, Z. Wu, Y. Sun, H. Tian, H. Wu, and H. Wang, \"Ernie-vilg: Unified generative pre-training for bidirectional vision-language generation,\" arXiv preprint arXiv:2112.15283, 2021. [114] J. Zhang, Y. Han, P. Zhang, J. Yang, L. Zhang, J. Gao, P. Wang, and L. Yuan, \"LAFITE: Towards Language-Free Training for Text-to-Image Generation,\" arXiv preprint arXiv:2111.13792, 2021. [115] W. Zhang, H. Liu, J. Xie, F. Faccio, M. Shou, and J. Schmidhuber, \"Cross-attention makes inference cumbersome in text-to-image diffusion models,\" arXiv preprint arXiv:2404.02747, 2024. [116] Y. Zhang, E. Tzeng, Y. Du, and D. Kislyuk, \"Large-scale Reinforcement Learning for Diffusion Models,\" arXiv preprint arXiv:2401.12244, 2024. [117] K. Zheng, X. He, and X. Wang, \"Minigpt-5: Interleaved vision-and-language generation via generative vokens,\" arXiv preprint arXiv:2310.02239, 2023. [118] W. Zheng, J. Teng, Z. Yang, W. Wang, J. Chen, X. Gu, Y. Dong, M. Ding, and J. Tang, \"Cogview3: Finer and faster text-to-image generation via relay diffusion,\" arXiv preprint arXiv:2403.05121, 2024. [119] M. Zhong, Y. Shen, S. Wang, Y. Lu, Y. Jiao, S. Ouyang, D. Yu, J. Han, and W. Chen, \"Multi-lora composition for image generation,\" arXiv preprint arXiv:2402.16843, 2024. [120] Y. Zhou, B. Liu, Y. Zhu, X. Yang, C. Chen, and J. Xu, \"Shifted diffusion for text-to-image generation,\" in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10157–10166.
Copyright © 2026 Sumedha Arya. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET77372
Publish Date : 2026-02-09
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here
Submit Paper Online
