When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models

Hitesh Kumar Gupta

Research Article

When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models

by Hitesh Kumar Gupta

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 31

Published: August 2025

Authors: Hitesh Kumar Gupta

10.5120/ijca2025925560

PDF

Hitesh Kumar Gupta . When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models. International Journal of Computer Applications. 187, 31 (August 2025), 1-9. DOI=10.5120/ijca2025925560

                        @article{ 10.5120/ijca2025925560,
                        author  = { Hitesh Kumar Gupta },
                        title   = { When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 31 },
                        pages   = { 1-9 },
                        doi     = { 10.5120/ijca2025925560 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Hitesh Kumar Gupta
                        %T When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 31
                        %P 1-9
                        %R 10.5120/ijca2025925560
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Image captioning, situated at the intersection of computer vision and natural language processing, requires a sophisticated understanding of both visual scenes and linguistic structure. While modern approaches are dominated by large-scale Transformer architectures, this paper documents a systematic, iterative development of foundational image captioning models, progressing from a simple CNN-LSTM encoder-decoder to a competitive attention-based system. This paper presents a series of five models, beginning with Genesis and concluding with Nexus, an advanced model featuring an EfficientNetV2B3 backbone and a dynamic attention mechanism. The experiments chart the impact of architectural enhancements and demonstrate a key finding within the classic CNNLSTM paradigm: merely upgrading the visual backbone without a corresponding attention mechanism can degrade performance, as the single-vector bottleneck cannot transmit the richer visual detail. This insight validates the architectural shift to attention. Trained on the MS COCO 2017 dataset, the final model, Nexus, achieves a BLEU-4 score of 31.4, surpassing several foundational benchmarks and validating the iterative design process. This work provides a clear, replicable blueprint for understanding the core architectural principles that underpin modern vision-language tasks.

References

S. Hochreiter and J. Schmidhuber, ”Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
M. Schuster and K. K. Paliwal, ”Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
M. Ranzato, S. Chopra, M. Auli, and D. Zaremba, ”Sequence level training with recurrent neural networks,” arXiv preprint arXiv:1511.06349, 2015.
O. Ronneberger, P. Fischer, and T. Brox, ”U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, ”Show and tell: A neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
K. Xu et al., ”Show, attend and tell: Neural image caption generation with visual attention,” in Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
P. Anderson et al., ”Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
M. Tan and Q. V. Le, ”EfficientNetV2: Smaller models and faster training,” in Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
M. Luong, H. Pham, and C. D. Manning, ”Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, ”From image descriptions to visual denotations,” Transactions of the Association for Computational Linguistics, 2014.
T.-Y. Lin et al., ”Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ”BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002.
S. Banerjee and A. Lavie, ”METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
R. Vedantam, C. L. Zitnick, and D. Parikh, ”Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
D. P. Kingma and J. Ba, ”Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
R. Kiros, R. Salakhutdinov, and R. S. Zemel, ”Multimodal log-bilinear model for image and text,” in Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.
A. Karpathy and L. Fei-Fei, ”Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
A. Vaswani et al., ”Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
A. Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
X. Li et al., ”Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision (ECCV), 2020.
J. Li, D. Li, C. Xiong, and S. Hoi, ”BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
J.-B. Alayrac et al., ”Flamingo: a visual language model for few-shot learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
B. Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Image Captioning Attention Mechanism Information Bottleneck Encoder-Decoder CNN RNN LSTM Spatial Encoder MS COCO