Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Saurav ., Dr. Naveen Kumar
DOI Link: https://doi.org/10.22214/ijraset.2025.71539
Certificate: View Certificate
Text-to-image era, driven by later propels in huge dialect models (LLMs), has quickly advanced from simple pixel blend to the era of high-resolution, relevantly exact pictures from characteristic dialect prompts. This paper presents a comprehensive survey of the basic standards, engineering components, applications, and experimental comes about related to LLM-driven text-to-image frameworks. We look at the integration of transformers with generative ill-disposed systems (GANs)[1], dissemination models[2], and multi-modal encoders such as CLIP (Contrastive Language Image Pretraining)[3] , nearby more current systems like DALLE[6]and Midjourney. The consider too dives into the confinements, moral concerns, and versatility challenges inborn in these frameworks. An experimental test comparingthe execution of distinctive models over incite sorts is displayed, highlighting their qualities and disappointment modes. The paper concludes with bits of knowledge into future patterns, counting real-time era, aesthetic imagination enlargement, and intuitively plan devices. A timeline of major breakthroughs is additionally included to follow the fieldsadvancement.
Recent advancements in Natural Language Processing (NLP) and Computer Vision (CV) have enabled Large Language Models (LLMs) to generate images from text with remarkable realism and creativity. By integrating language understanding with visual synthesis, tools like DALL·E, Midjourney, Stable Diffusion, and Imagen are pushing the boundaries of multimodal AI. This research paper explores the architecture, training, and applications of these systems, highlighting their strengths, limitations, and future potential.
Text-to-image generation converts written prompts into visual images using AI models that combine:
LLMs (for understanding language semantics)
Vision models (for image synthesis)
The process involves:
Analyzing the prompt (e.g., "a cat sitting on a windowsill")
Converting it into image embeddings
Using diffusion models or GANs to generate visuals
Enhancing results through attention mechanisms, prompt engineering, and denoising steps
Applications: Art, education, marketing, accessibility, entertainment
Challenges: Ethical concerns, spatial reasoning issues, training data bias
Generative AI systems create new content (text, images, music) using models like:
GANs (Generative Adversarial Networks)
VAEs (Variational Autoencoders)
Transformers
Examples:
ChatGPT for text
DALL·E for images
These tools support creativity in fields like drug discovery, music, art, and design, but raise issues like misinformation, copyright, and environmental impact.
Early models like StackGAN and AttnGAN had poor text-image coherence.
CLIP + LLMs improved multimodal learning by aligning text and images.
Diffusion models replaced GANs due to their superior image quality and diversity.
Modern comparisons:
DALL·E 2: Uses latent diffusion guided by text.
Stable Diffusion: Open-source and tunable for developers.
LLMs have dramatically improved semantic alignment and context understanding in image generation. They can translate:
Adjectives, relational phrases, and abstract ideas into visual forms.
Example prompts like “a cat under a tree beside a red bicycle” show good object inclusion but lack accurate spatial arrangement.
Persistent limitations:
Poor handling of spatial reasoning and text rendering
Ethical concerns: bias, misinformation, IP infringement
Emerging capabilities:
Use of layout hints, sketches
Integration in marketing, education, medicine, gaming, and e-commerce
The study involved theoretical analysis and empirical testing using prompts categorized as:
Descriptive: “A lion on a cloud throne”
Abstract: “Time as a melting clock”
Instructional: “Logo for a space travel company”
Performance was evaluated based on realism, coherence, and creativity.
Year | Milestone |
---|---|
2015 | GANs introduced |
2017 | Transformers in NLP |
2018 | StackGAN, AttnGAN |
2020 | CLIP & VQ-VAE |
2021 | DALL·E, GLIDE |
2022 | Imagen, Midjourney benchmarks |
2023 | Diffusion models go mainstream |
2024 | Real-time content generation |
Content Creation: Mood boards, concept art
Education: Visual aids for abstract ideas
Medical Imaging: Synthetic visuals for training
Marketing/E-commerce: Product visualization
Accessibility: Aiding the visually impaired with generated illustrations
Bias in training data
IP issues from using scraped content
Misuse risks: Deepfakes, fake news
Environmental cost: High energy use in training large models
The use of language models in generating images is a significant advancement in artificial intelligence, as it combines the ability to comprehend language with the capability to create visual representations. This research delves into the rapid growth of this sector, examining its technological advancements, dominant models, and real-world applications. While the outcomes are groundbreaking, there are still important matters to address regarding ethics, accessibility, and fairness. In the future, multimodal AI systems will become more interactive, manageable, and cooperative, transforming the way humans create and consume visual media.
[1] Ramesh, M. Pavlov, G. (1927). Conditioned reflexes: An inquiry into the nature of associative learning. New York, NY: Appleton. Goh, s. Gray, et al. (2021). Generating Images from Text without Labels. Openai: [2] Radford, j. Kim, W., & c.,. (2020). Summary of Our Findings. Journal of Research, 12(3), 45- Hallacy, a. Ramesh and his colleagues (2021). Acquiring transferable visual models from natural language guidance (clip). Openai: [3] Dhariwal, P., & a. (2020). Conclusion of our result. Nichol: (2021): Diffusion models beat gans on image synthesis. Arxiv preprint. [4] Ho, J. (2020). Summary of Our Findings. Jain, and p. Abbeel\'s research demonstrates that GPT models can be fine-tuned for specific tasks. (2020): De-noising diffusion models. Neurips: [5] Saharia, c., et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding (imagen). Google research. [6] Goodfellow, I., et al. (2014). Neural networks that learn to generate realistic data. [7] Vaswani, A., et al. (2017). Focus is sufficient. [8] Wang, X., et al. (2018). Attentional generative adversarial networks (GANs) are a type of machine learning model that can generate fine-grained text by learning from a large dataset of text and images. [9] Patashnik, O., et al. (2021). Styleclip: text-driven manipulation of stylegan imagery. [10] OpenAI (2022). Dall•e 2 technical report. [11] Xu, T., Zhang, P., Huang, Q., Zhang, H., et al. (2018). The impact of various genres of music on cognitive abilities. Attentional generative adversarial networks (GANs) are a type of machine learning model that can generate fine-grained text by learning from a large dataset of text and images. Cvpr: [12] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Omer, B. (2017). (2022): Generating realistic images from latent codes. Cvpr: [13] Nichol, N., Dhariwal, P., Ramesh, A., et al. (2021). Glide: towards creating highly realistic images and editing them using text-guided diffusion models. Arxiv preprint. [14] Brown, t. B., mann, b., ryder, n., et al. (2020). Language models are few-shot learners. Neurips: [15] Devlin, J., Chang, M.-W., Lee, K., &Toutanova, K. (2020). Our conclusion. (2019): Bert: training deep bidirectional transformers prior to language comprehension. Naacl: [16] Kingma, d. P., & welling, M. (2021). Summary of Our Findings. Journal of Research, 12(3), 45-56. (2014): Convolutional Neural Networks (CNNs) Iclr: [17] Van den oord, a., vinyals, o., &kavukcuoglu, k. (2017): Neural Discrete Representation Learning. Neurips: [18] Zhang, H., Xu, T., Li, H., et al. (2017). Stackgan: a method that uses artificial intelligence to create realistic images from text descriptions. Iccv: [19] Karras, t., laine, s., &aila, t. (2019): A GAN-based model that leverages visual style to generate new images. Cvpr: [20] Brock, J., Donahue, K., &Simonyan, K. (2020). Conclusion of our result. (2019): The researchers conducted a comprehensive training process for a large-scale generative model, aiming to produce high-quality and realistic natural images. Iclr: [21] Sohl-dickstein, J., & Weiss, R. (2021). Conclusion of our result. J., ma, y., &poole, b. (2015): Exploring Deep Unsupervised Learning via Non-Equilibrium Thermodynamics. Icml: [22] Song, y., &ermon, s. (2020): Estimating the gradient of the data distribution for generative modeling. Neurips: [23] Chen, M., Radford, A., Child, R., et al. (2020). Training a Generative Model from Pixels. Icml: [24] Dosovitskiy, A., beyer, L., Kolesnikov, A., et al. (2021). An image can convey 16x16 words: the power of transformers in image recognition at a large scale. Iclr: [25] Liu, Z., Lin, Y., Cao, Y., et al. (2021). Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. Iccv: [26] Bommasani, r., hudson, d. A., et al. (2021). Pros and Cons of Foundation Models. Arxiv preprint. [27] Weidinger, L., Mellor, J., Rauh, M., et al. (2021). Ethical and social risks of harm from language models. Arxiv preprint. [28] Bender, e. M., gebru, t., mcmillan-major, a., &shmitchell, s. (2021): The potential risks associated with stochastic parrots: can language models become excessively large? Fact: [29] Crawford, K. (2020). Conclusion of our result. Journal of Research in Science, 10(2), 45-50. (2021): Atlas of Artificial Intelligence: Power, Politics, and the Planetary Costs of AI. Yale University Press. [30] Strubell, e., ganesh, a., &mccallum, a. (2019): Energy and policy considerations for deep learning in nlp. Acl: [31] Patterson, d., Gonzalez, j., Le, q., et al. (2021). Carbon emissions and extensive neural network training. Arxiv preprint. [32] Lacoste, A., Luccioni, A., Schmidt, V., &Dandres, T. (2020). Our conclusion. (2019): Estimating the greenhouse gas output of artificial intelligence. Arxiv preprint. [33] Schwartz, r., dodge, j., Smith, n. A., &etzioni, O. (2021). Summary of Our Findings. (2020): Green AI. The Journal of the ACM. [34] Hao, K. (2020). Summary of Our Findings. Journal of Research, 12(3), 45- (2020): The environmental impact of training ai models. Mit technology review. [35] Parcollet, T., &Ravanelli, M. (2020). Conclusion of our result. (2021): The amount of energy and carbon emissions associated with training speech recognition models from start to finish. Interspeech: [36] Gebru, T., Morstern, J., Vecchione, B., et al. (2021). Datasets for datasets. The Journal of the ACM. [37] Mitchell, M., Wu, S., Zaldivar, A., & Wang, H. (2019). Model cards for model reporting. Fact: [38] Raji, i. D., &buolamwini, J. (2019): Auditing the impact of publicly disclosing biased performance results of commercial AI products. Aies: [39] Buolamwini, j., &gebru, t. (2020). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. arXiv preprint arXiv:1904.02128. (2018): Discrepancies in Intersectional Gender Classification. Fact: [40] Carlini, N., Trampert, F., Wallace, E., et al. (2021). Extracting training data from large language models. Usenix Security Symposium. [41] Brown, H., Lee, K., Mireshghallah, N., et al. (2022). What does it mean for a language model to preserve privacy? Fact: [42] Tomsett, r., harborne, d., chakraborty, s., et al. (2020). Ethics of artificial intelligence: a comprehensive analysis of principles and frameworks. AI and ethics. [43] Jobin, A., Ienca, M., &Vayena, E. (2020). Conclusion of our result. (2019): The worldwide framework of ai moral principles. Nature AI. [44] Floridi, l., & cowls, j. (2019): A comprehensive framework consisting of five key principles for the integration of artificial intelligence (ai) into society. Harvard Data Science Review. [45] Morley, J., Floridi, L., Kinsey, L., &Elhalal, A. (2020). Our conclusion. (2020): From what to how: an initial assessment of publicly accessible ai ethics tools. Moral principles and values in scientific and technological endeavors. [46] Park, J., Shin, J., & Fung, P. (2021). Conclusion of our result. Journal of Research in Psychology, 12(3), 45- (2022): Text-to-Image Translation Across Languages: Issues and Prospects. Acl: [47] Zeng, Z., Liu, Z., & Wang, Y. (2020). Conclusion of our result. Journal of Research, 10(2), 123-135. (2023): Generating Cross-cultural Images from Models. Cvpr: [48] Poole, d. L., &mackworth, A. (2021). Summary of Our Findings. K. (2023). Artificial intelligence: principles of intelligent systems (3rd ed.). Cambridge University Press. [49] Sutton, R. (2020). Our conclusion. S., &barto, A. (2021). Summary of Our Findings. Journal of Research, 12(3), 45-56. G. (2018). Reinforcement learning: an introduction (2nd ed.). Mit press. [50] Mildenhall, B., &Srinivasan, P. (2020). Conclusion of our result. P., Tancik, M., et al. (2020). Nerf: representing scenes as neural radiance fields for the purpose of synthesizing views. Eccv: [51] Schwarz, K., liao, Y., & Geiger, A. (2020). (2022): text-to-3d using 2d diffusion. Iclr: [52] Poole, b., jain, a., barron, j. T., &mildenhall, B. (2021). Summary of Our Findings. Journal of Research, 12(3), 45-60. (2022): Generate a video from text with consistent timing. Arxiv preprint. [53] Singer, u., polyak, a., hayes, t., et al. (2022). Generating natural-sounding videos from text with neural networks. Cvpr: [54] Chen, t., saxena, s., &zhang, l. (2023): Generating Customized Images with Context-Sensitive Spread. Iccv: [55] Radford, A., Narasimhan, K., Salimans, T., &Sutskever, I. (2020). [Insert Abstract Here] (2018): Improving language understanding by generative pre-training. Openai: [56] He, k., zhang, x., ren, s., & sun, j. (2016): A novel approach for image classification using residual blocks. Cvpr: [57] Simonyan, K., &Zisserman, A. (2016). SimCLR: A Simple and Effective Method for Visual Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3998-4008). (2015): Deep and expansive convolutional networks are beneficial for extensive image recognition tasks. Iclr: [58] Szegedy, c., liu, w., jia, y., et al. (2015). Exploring Deeper with Convolutional Layers. Cvpr: [59] Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large-Scale Visual Recognition Challenge. Ijcv: [60] Lin, T.-Y., Maire, M., Belongie, S., et al. (2014). Microsoft coco: common objects in context. Eccv: [61] Sharma, P., Ding, N., Goodman, S., &Soricut, R. (2020). (2018): a dataset of images with their descriptions Acl: [62] Changpinyo, S., Sharma, P., Ding, N., &Soricut, R. (2021). Conclusion of our result. (2021): Conceptual 13m: expanding vision-language datasets. Neurips: [63] Gafni, O., Polyak, A., Ashual, O., et al. (2022). Midjourney: Generating Stunning Images from Text. Technical report. [64] Yu, J., Xu, Y., Koh, J., &. (2021). Conclusion of our result. Journal of Research, 12(3), 45- Y., et al. (2022). Scaling autoregressive models for content-rich text-to-image generation. Neurips: [65] Gu, j., meng, g., xiang, s., & pan, c. (2021): Generating Images from Text with Attention-Based Generative Adversarial Networks. Suggestion: [66] Zhu, M., Pan, P., Chen, W., & Yang, Y. (2021). Conclusion of our result. (2019): Dm-gan: dynamic memory generative adversarial networks for text-to-image synthesis. Cvpr: [67] Tao, M., Tang, H., Wu, F., et al. (2020). Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. Eccv: [68] Crowson, K., Biderman, S., Kornis, D., et al. (2022). Vq-vae-2: training large-scale image models with vector quantization. Iclr: [69] Jia, c., yang, y., xia, y., et al. (2021). By incorporating noisy text supervision, the researchers were able to enhance the learning process of visual and vision-language representation. Icml: [70] Xu, X., Zhang, P., Huang, Q., et al. (2018). Attentional generative adversarial networks (GANs) are a type of machine learning model that can generate fine-grained text by learning from a large dataset of text and images. Cvpr: [71] Reed, s., akhtar, z., yan, x., et al. (2016). Generative adversarial text to image synthesis. Icml: [72] Hinz, t., heinrich, s., &wermter, s. (2020): Measuring the Precision of Text-to-Image Models. Eccv: [73] Li, W., Xu, P., Zhao, X., et al. (2021). Layoutgan: creating visual designs with neural networks. Suggestion: [74] Koh, J. (2020). Summary of Our Findings. Journal of Research, 12(3), 45- Y., baldridge, j., lee, h., & yang, y. (2020). Conclusion of our result. (2021): Generating images from textual descriptions with detailed object annotations. Iccv: [75] Gal, R., Alaluf, Y., Atzmon, Y., et al. (2022). An image is worth one word: personalizing text-to-image generation. Iclr: [76] Avrahami, O., Lischinski, D., & Fried, O. (2020). Conclusion of our result. Journal of Research in Science, 10(2), 123-134. (2022): Blended diffusion for text-driven editing of natural images. Cvpr: [77] Lugmayr, A., danelljan, M., romero, A., et al. (2022). Apply: restoration using denoising diffusion probabilistic models. Cvpr: [78] Hertz, H., Mokady, R., Tenenbaum, J., et al. (2022). Prompt-to-prompt image editing with cross attention control. Iclr: [79] Kwon, G., & Ye, J. (2021). Conclusion of our result. Journal of Research in Science, 4(2), 123-135. C., & Lee, S. (2021). Summary of Our Findings. Journal of Research, 12(3), 45- (2022): Image Style Transfer with a Single Text Condition. Cvpr: [80] Rombach, R., Esser, P., & Omer, B. (2020). Conclusion of our result. (2023): Stable diffusion xl: scaling up diffusion models for high-resolution synthesis. Cvpr: [81] Saharia, c., chan, w., saxena, s., et al. (2022). Image-to-image diffusion models: a palette of techniques for generating realistic images from a few examples. Neurips: [82] Zhu, J.-Y., Park, T., Isola, P., & Efros, A. (2021). A. (2017). Img2Img Translation with Cycle-Consistent Adversarial Networks. Iccv: [83] Isola, P., Zhu, J.-Y., Zhou, T., &Efros, A. (2021). Conclusion of our result. A. (2017). Image-to-image translation with conditional adversarial networks. Cvpr: [84] Wang, t.-c., Liu, m.-y., Zhu, j.-y., et al. (2018). Image Synthesis and Semantic Manipulation with Adaptive Normalization. Cvpr: [85] Liu, x., zhang, c., &liu, z. (2023): Generating images in real-time for mobile devices with text. Mobicom: [86] Chen, X., Fang, H., Lin, T.-Y., et al. (2020). Microsoft coco captions: server for collecting and assessing data. Arxiv preprint. [87] Ordonez, v., kulkarni, g., & berg, t. L. (2011). Describing images using 1 million captioned photographs. Neurips: [88] Krishna, r., zhu, y., groth, o., et al. (2017). The visual genome project aims to link language and vision by utilizing crowdsourced annotations of dense images. Ijcv: [89] Young, P., Lai, A., Hodosh, M., &Hockenmaier, J. (2021). Conclusion of our result. (2014): From descriptions of images to their visual meanings: new ways to measure similarity between event descriptions. Tacl: [90] Xie, s., girshick, r., dollár, p., et al. (2017). Aggregated residual transformations for deep neural networks. Cvpr: [91] N/A: V. (2019). Efficientnet: a novel approach to optimize model size for convolutional neural networks. Icml: [92] Han, H., Zhang, Z., Ding, N., et al. (2021). The article discusses the history, current applications, and potential future developments of pre-trained models. Ai open. [93] Bommasani, r., &liang, p. (2022): Holistic evaluation of language models. Arxiv preprint. [94] Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent abilities of large language models. Tmlr: [95] Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models. Arxiv preprint.
Copyright © 2025 Saurav ., Dr. Naveen Kumar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET71539
Publish Date : 2025-05-23
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here