Image captioning task requires effective combination of visual feature extraction and natural language generation. This study compares four pre-trained models such as Vision Transformer (ViT-B/16), ResNet-18, VGG-16, and DenseNet-121 when applied as frozen feature extractors in a prefix-based captioning framework using a partially trainable BERT-base-uncased text encoder. Experiments were conducted on a 32,000-image subset of the MS COCO 2017 captions dataset (28,000 training, 4,000 validation) under a limited training environment. Performance was evaluated using training cross-entropy loss. Results show DenseNet-121 achieved the lowest final loss (0.2894), followed by VGG-16 (0.3198), ResNet-18 (0.3935), and ViT-B/16 (0.7002). DenseNet-121 demonstrated superior feature richness and fastest generalization, while ViT exhibited slowest convergence. These findings suggest that, under resource-constrained scenarios with frozen backbones, DenseNet-121 is the most effective choice among the evaluated architectures.
Introduction
This study investigates the performance of four pre-trained vision models—ViT-B/16, ResNet-18, VGG-16, and DenseNet-121—in an image captioning task under strict training constraints. With the rapid growth of online image data, automatic image captioning has become essential for reducing manual labeling efforts. However, challenges such as noisy data, large datasets, high computational demands, and long training times remain significant.
Unlike prior work that often fine-tunes full models or uses complex architectures (e.g., GANs or reinforcement learning approaches), this research focuses on a lightweight prefix-based captioning framework. Each vision model is used as a fully frozen feature extractor, and only a projection layer and a partially trainable BERT text encoder are trained. The objective is to determine which frozen vision backbone adapts fastest and achieves the lowest training loss under limited training conditions (3 epochs).
Literature Background
Traditional image captioning methods rely on RNNs trained with Maximum Likelihood Estimation (MLE), which suffer from exposure bias. GAN-based and reinforcement learning approaches were introduced to improve caption realism and alignment, but they introduce training instability and reward sparsity issues. Despite advancements, image captioning still demands large computational resources and extensive fine-tuning.
This study instead evaluates the efficiency of frozen vision models within a simplified architecture, addressing a less explored question: Which pre-trained backbone performs best when kept completely frozen?
Methodology
Dataset: MS COCO 2017 (32,000 images; 28,000 train, 4,000 validation)
Preprocessing: Image resizing (224×224), normalization, no augmentation; captions cleaned and normalized
Model Framework:
Frozen image encoder (ViT, ResNet, VGG, DenseNet)
Feature projection into text embedding space
Prefix token added to BERT-base-uncased
Linear classification head for next-token prediction
Training Setup:
3 epochs
AdamW optimizer
Learning rate: 4 × 10??
Batch size: 24
Cross-entropy loss
Same hyperparameters and random seed for fairness
Performance is evaluated based on training loss after 3 epochs, with lower loss indicating faster adaptation.
Results
After three epochs:
DenseNet-121 achieved the lowest final loss (0.2894) → Best performance (1st place)
VGG-16 ranked 2nd (0.3198)
ResNet-18 ranked 3rd (0.3935)
ViT-B/16 performed worst (0.7002)
Key findings:
DenseNet-121 performed best, suggesting that its dense connectivity structure produces highly informative and transferable features, even when frozen.
VGG-16 outperformed newer architectures, indicating that strong feature diversity can outperform architectural complexity under limited training.
ViT-B/16 showed slow convergence, possibly due to its need for more fine-tuning and larger datasets to fully leverage transformer-based representations.
ResNet-18 trained fastest, making it practical when computational efficiency is a priority.
Conclusion
Under extremely limited training conditions, DenseNet-121 emerged as the most effective architecture for the prefix-based image captioning task. The overall ranking observed in this experimental setting is as follows:
DenseNet-121 > VGG-16 > ResNet-18 > ViT-B/16.
From a practical perspective, when computational resources or training time are severely constrained, for example rapid prototyping, educational experiments, or low-budget environments, DenseNet-121 appears to be the most suitable frozen backbone choice. However, it is important to note that with extensive training, larger datasets, and fine-tuning, more advanced architectures, such as vision transformers, are likely to outperform traditional CNN-based models.
References
[1] Ding, S., Qu, S., Xi, Y., & Wan, S. (2020). Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing, 398, 520--530.
[2] Chen, S., Jin, Q., Wang, P., & Wu, Q. (2020). Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9962--9971).
[3] Zhou, Z., Zhang, X., Li, Z., Huang, F., & Xu, J. (2022). Multilevel attention networks and policy reinforcement learning for image caption generation. Big Data, 10(6), 481--492.
[4] Agrawal, V., Dhekane, S., Tuniya, N., & Vyas, V. (2021, July). Image Caption Generator Using Attention Mechanism. In 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT) (pp. 1--6). IEEE.
[5] Zhao, S., Li, L., Peng, H., Yang, Z., & Zhang, J. (2020). Image caption generation via unified retrieval and generation-based method. Applied Sciences, 10(18), 6235.
[6] Liu, X., & Xu, Q. (2020). Adaptive attention-based high-level semantic introduction for image caption. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(4), 1--22.
[7] Mounika, S., & Vijaybabu, P. (2022). Image caption generator using cnn and lstm. South Asian Journal of Engineering and Technology, 12(3), 78--86.