Captioning Images with Words: A Transformer-based Image Captioning Model

Authors: Yuvanesh M, Ms. Sathiyapriya K

DOI Link: https://doi.org/10.22214/ijraset.2025.71136

Abstract

Imagecaptioningrepresentsacomplexinterdisciplinary task that merges computer vision and natural language processing to produce coherent and contextually meaningful descriptions of visual content. This research focuses on the development of a custom transformer-based model aimed at addressing the limitations of traditional captioning approaches, particularly in terms of semantic accuracy and contextual relevance. The proposed architecture incorporates pre-trained convolutional neural network (CNN) for effective image feature capturing, followed by transformer-based mechanismsforgeneratingnaturallanguagedescriptions.Toassessthe effectiveness of the model, a comparative evaluation is conducted against a widely used LSTM-based captioning framework. ExperimentsarecarriedoutontheFlickr8kdataset,withperformance measured using BLEU scores. Results indicate that the transformer- based approach offers notable improvements in the quality and relevance of generated captions, demonstrating its potential for practical applications in areas such as media content analysis, e- commerce, and assistive technologies.

Introduction

Overview

This research explores the development of an advanced image captioning framework aimed at converting visual data into meaningful textual descriptions. Such systems have applications in accessibility (e.g., for visually impaired users) and content recommendation. The framework integrates modern image processing techniques with natural language generation, using Transformer and LSTM-based models enhanced with attention mechanisms.

Motivation & Background

Generating descriptive captions is a complex multimodal task combining computer vision and natural language processing (NLP).
Early methods relied on CNN-LSTM encoder-decoder models but struggled to capture fine-grained image details.
Attention mechanisms (e.g., Xu et al.) allowed models to focus on relevant image regions.
Transformer-based architectures, such as the Meshed-Memory Transformer and Vision Transformers (ViT), improved contextual modeling and scalability.
Models like CLIP have enabled powerful vision-language representations, enhancing generalization and reducing dependence on large labeled datasets.

Proposed Methodology

1. Transformer-Based Model (Implemented in PyTorch)

Architecture: Uses an encoder-decoder structure.
- Encoder: A pre-trained CNN (e.g., InceptionV3 or ResNet) extracts visual features.
- Decoder: Multi-layer Transformer decoder with self-attention and encoder-decoder attention.
Caption Generation: The decoder generates one word at a time, using the previous word and image features.
Training:
- Dataset: Flickr8k (8,000 images with multiple captions).
- Preprocessing: Resizing images, normalizing pixels, tokenizing captions, building a vocabulary.
- Training Strategy: Used cross-entropy loss, Adam optimizer, and teacher forcing.
- Evaluation: Measured using BLEU scores.
Results:
- Achieved BLEU score of 0.463.
- Generated longer and more contextually accurate captions.
- Sample Captions:
  - “Two men are posing for picture”
  - “A boy in a red shirt is standing on a rock overlooking a stream”

2. LSTM-Based Model with Attention (Implemented in TensorFlow)

Architecture: Encoder-decoder with an attention mechanism.
- Encoder: Pre-trained CNN (e.g., InceptionV3) extracts image features.
- Attention Layer: Computes context vectors to focus on relevant image regions.
- Decoder: LSTM processes context vectors and previously generated words to generate captions.
Training:
- Similar preprocessing and dataset setup as the Transformer model.
- Used cross-entropy loss and Adam optimizer.
Challenges:
- Computational resource limitations during training.
Results:
- Produced concise and accurate captions, particularly for simpler images.
- Examples:
  - “Man in red shirt is climbing rock”

Key Observations

Transformer Model:
- Excelled at handling long-range dependencies.
- Better at generating detailed and coherent captions.
- More scalable and suitable for real-world deployment.
LSTM with Attention:
- Simpler to implement and interpret.
- Effective in low-resource or educational settings.
- Performs well for less complex images but less accurate for complex scenes.

Conclusion

Thisresearch workinvestigatesthedevelopment oftwodistinct image captioning models: one utilizing the transformer architecture and the other incorporating an LSTM with an attention mechanism. Both models successfully generated meaningful captions for images, highlighting the potential of deep learningin interpretingand describing visual content. The transformer-based model, leveraging its self-attention mechanism, offered an efficient method for sequence generation, while the LSTM model with attention effectively concentrated on relevant image features, enhancing the overall quality of the captions. Lookingforward, there is considerable potential to enhance the current models. One promising area for improvement is the integrationoftheVisionTransformer(ViT)model.Specifically designedfor visualtasks,theViTleveragesatransformer-based architecture known for its robustness and effectiveness in handling complex image data. While our project concluded in the early stages of experimenting with the ViT, future research will aim to explore its capabilities further. The ViT has demonstrated potential in capturing detailed image features, which could significantly improve captioning accuracy, presenting an exciting direction for future advancements in image captioning. Additionally, addressing the limitations of the current dataset and improving the model\'s generalization capabilities by expanding the dataset diversity and enhancing computational resources are crucial for advancing the field. Future work mayalsoinvolveexploringadvancedimagecaptioningevaluation In conclusion, this research work successfully demonstrated the use of transformer and LSTM-based models for image captioning, the exploration of more robust models, like ViT, coupled with enhanced computational resources, promises to significantlyimprovecaptionqualityandgeneralizationacross diverse datasets.

References

[1] H. Zhang, X. Wang, Z. Li, and J. Chen, \"Multimodal ImageCaptioningwithTransformer-BasedArchitecture,\" IEEETrans.ImageProcess.,vol.30,pp.1243-1255,Nov. 2021,doi:10.1109/TIP.2021.3075321 https://ieeexplore.ieee.org/document/9679846. [2] Z. Liu,Y. Zhang,and X.Zhao, \"ImageCaptioningUsing Deep Reinforcement Learning,\" IEEE Trans. Neural Netw. Learn.Syst.,vol.32,no.2,pp.789-799,Feb.2021, doi:10.1109/TNNLS.2020.2968491. https://ieeexplore.ieee.org/document/9256267. [3] P.Chen,S.Wang,andL.Yu,\"Visual-Semantic Alignmentfor ImageCaptioning,\"PatternRecognit.,vol. doi:10.1016/j.patcog.2021.107832 https://www.journals.elsevier.com/pattern-recognition. [4] H. Sun and Z. Yang, \"Learning to Caption with Context- Aware Image Features,\"J. Mach. Learn. Res., vol. 22, p. 345,Jan.2022https://www.jmlr.org/papers/volume22/22-345/22-345.pdf. [5] J. Wang, Y. Li, and Q. Xie, \"A Survey on Image CaptioningMethods: Challenges and Techniques,\"Int. J. Comput.Vis.,vol.121,no.4,pp.476-500,May2023,doi: 10.1007/s11263-023-01673-0. https://link.springer.com/article/10.1007/s11263-023-01673-0. [6] X. Zhangand J. Liu, \"Image Captioning via Hierarchical Attention Networks,\" Comput. Vis. Image Understand., vol. 224, p. 103510, Oct. 2023, doi: 10.1016/j.cviu.2023.103510. https://www.journals.elsevier.com/computer-vision-and-image-understanding. [7] J. Lee and K. Park, \"Semantic Image Captioning with Multimodal Embeddings,\" IEEE Trans. Multimedia, vol. 26,no.2,pp.102-115,Feb.2024,doi:10.1109/TMM.2024.3119531https://ieeexplore.ieee.org/document/9262280. [8] A.DeyandM. Gupta, \"ContextualizedImageCaptioning Using Generative Models,\" in Proc. IEEE Int. Conf. Comput.Vis.(ICCV),2022,pp.1234-1242.[https://openaccess.thecvf.com/content/ICCV2022/html/Dey_Contextualized_Image_Captioning_Using_Generative_Models_ICCV_2022_paper.html. [9] V. Kumar and P. Verma, \"Multimodal Fusion for Image Captioning:ASurvey,\"inProc.IEEEConf.Comput.Vis. Pattern Recognit.(CVPR), 2023,pp.1245-1260. https://openaccess.thecvf.com/content/CVPR2023/html/Kumar_Multimodal_Fusion_for_Image_Captioning_A_Survey_CVPR_2023_paper.html. [10] S. Reddy and S. Babu, \"End-to-End Image Captioning UsingVisualTransformers,\"in Proc.Eur.Conf. Comput. Vis.(ECCV),2024,pp.43 End_Image_Captioning_Using_Visual_Transformers_ECCV_2024_paper.html.

Copyright

Copyright © 2025 Yuvanesh M, Ms. Sathiyapriya K. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET71136

Publish Date : 2025-05-16

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here