Image Caption Generator: A Comprehensive Review

Authors: Himanshi Yadav, Khushi Rastogi, Dr. Rajan Prasad, Er. Anurag Chauhan

DOI Link: https://doi.org/10.22214/ijraset.2025.75144

Abstract

In recent years Computer vision has drastically advanced in the field of image processing. Image captioning, which involves automatically generating one or more captions to comprehend an image’s visual information, has benefited from advancement in image detection. In this paper we aim to prove that combination of existing methods can efficiently improve the performance in image detection. The approach involves generating meaningful captions by combining computer vision and neural language processing. The technologies used are Convolutional Neural Networks (CNNs) to extract image features and Long Short-Term Memory (LSTM) networks with attention mechanisms to produce coherent sentences. A trained model that has been trained with algorithm and Flickr8K, Flickr30K dataset will produce the caption. We talk about Python, TensorFlow and deep learning frameworks for making this paper. The main use case of this research is to help visually impaired to understand surrounding environment. It can be used in hospitals to treat patients with any neurological conditions. The paper reviews the previously done researches and enhance the model already present and discuss their advantages, disadvantages and future scopes.

Introduction

The text discusses Image Caption Generation, a technology that enables computers to automatically generate descriptive sentences for images, combining Computer Vision and Natural Language Processing (NLP). Computer Vision identifies objects or scenes, while NLP generates grammatically correct captions. Deep learning techniques like Convolutional Neural Networks (CNNs) for feature extraction and Long Short-Term Memory (LSTM) networks for sequential text generation are commonly used. Models often leverage pretrained architectures like VGG16 and datasets such as Flickr8k for training.

Literature review highlights:

Canonical CNN-LSTM models for basic captioning.
Attention mechanisms to focus on relevant image regions.
Training objectives using maximum likelihood or reinforcement learning (e.g., Self-Critical Sequence Training).
Datasets like Flick8k and MS COCO for evaluation.
Evolution toward transformer-based models replacing LSTM and CNN for improved performance.
Comparison of supervised learning vs reinforcement/GAN approaches.
Variants of encoder-decoder and compositional architectures for generating captions.

Research objectives focus on designing a system to generate human-like, accurate, and relevant captions, analyzing existing models, extracting image features with CNNs, integrating attention, and evaluating performance using metrics like BLEU.

Proposed methodology uses an Encoding-Attention-Decoding framework:

Encoder: Pretrained CNN extracts visual features.
Attention: Focuses on important regions during decoding.
Decoder: LSTM or Transformer generates captions sequentially.
Training: Cross-entropy loss with teacher forcing, optionally reinforced with BLEU reward.
Evaluation: Automatic metrics (BLEU, METEOR) and human validation.
Deployment: Model served via API for real-time caption generation.

Model components:

CNN: Extracts image features efficiently.
LSTM: Decodes features into sequential text while retaining context.
VGG16: Popular CNN for detailed feature extraction.
Encoder-Decoder: Encodes captions into vectors and decodes them into sentences.
Libraries used: Keras, TensorFlow, Pillow, NumPy, Tqdm for image processing, model training, and evaluation.

The system is designed to provide meaningful captions for a wide range of images, supporting applications in accessibility, content organization, SEO, digital libraries, and education.

Conclusion

In this paper, we have discussed about the design and implementation of an Image Caption Generator model using Long Short-Term Memory (LSTM) and Convolution Neural network (CNN)networks. The CNN was used to extract visual features from the input images, while the LSTM generated grammatically correction based on those feature. The model successfully bridge the gap between natural language processing and computer vision by converting visual information into meaningful sentences. The involvement of the attention mechanism reputed the model’s ability to focus on image regions, improving caption accuracy. In the future, the model can further improve by training on larger dataset to increase caption accuracy and contextual understanding. Caption generation can make the system accessible to the audience. The use of CNN architectures such as EfficientNet or Vision Transformers, along with Transformer-based language models like BERT or GPT, can further increase caption quality and efficiency. The model can be integrated with IoT devices for the visually impaired, and real-time surveillance systems, making it a valuable tools for accessible and smart automation. Implementing real-time image captioning on devices would also make the system more efficient and practical for real-world application.

References

[1] Hiba Ahsan, Daivat Bhatt, Kaivan Shah, and Nikita Bhalla. 2021. Multi-modal image captioning for the visually impaired. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. 53–60. [2] Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, and Rita Cucchiara. 2022. The unreasonable effectiveness of clip features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4662–4670. [3] A. Abbas, K. Tirumala, D. Simig, S. Ganguli, and A. S. Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023 [4] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. [5] H. Bansal and A. Grover. Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023. [6] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. [7] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, and X. Qi. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022. [8] X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17980–17989, 2022. [9] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. [10] J. Li, D. Li, C. Xiong, and S. Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022. [11] I. Shumailov, Z. Shumaylov, Y. Zhao, Y. Gal, N. Papernot, and R. Anderson. Model dementia: Generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023. [12] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. [13] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022 [14] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021. [15] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022. [16] Simonyan, K. & Zisserman, A. (2015). “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556, and other standard deep learning resource [17] D. Zhu, J. Chen, K. Haydarov, X. Shen, W. Zhang, and M. Elhoseiny. Chatgpt asks, blip2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023. [18] González-Chávez, O.; Ruiz, G.; Moctezuma, D.; Ramirez-delReal, T. Are metrics measuring what they should? An evaluation of Image Captioning task metrics. Signal Process. Image Commun. 2024, 120, 117071.

Copyright

Copyright © 2025 Himanshi Yadav, Khushi Rastogi, Dr. Rajan Prasad, Er. Anurag Chauhan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET75144

Publish Date : 2025-11-07

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here