A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM

Chandra Sekhar Sanaboina; Girija Sankar Rotta

Research Article

A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM

by Chandra Sekhar Sanaboina, Girija Sankar Rotta

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 31

Published: August 2025

Authors: Chandra Sekhar Sanaboina, Girija Sankar Rotta

10.5120/ijca2025925536

PDF

Chandra Sekhar Sanaboina, Girija Sankar Rotta . A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM. International Journal of Computer Applications. 187, 31 (August 2025), 10-19. DOI=10.5120/ijca2025925536

                        @article{ 10.5120/ijca2025925536,
                        author  = { Chandra Sekhar Sanaboina,Girija Sankar Rotta },
                        title   = { A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 31 },
                        pages   = { 10-19 },
                        doi     = { 10.5120/ijca2025925536 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Chandra Sekhar Sanaboina
                        %A Girija Sankar Rotta
                        %T A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 31
                        %P 10-19
                        %R 10.5120/ijca2025925536
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Image captioning, an interrelated task between computer vision and natural language processing, used to generate descriptive textual captions for given images. This paper presents an optimized deep learning-based Image Captioning System (ICS) that employs a Vision Transformer (ViT) as an image feature extractor and a Long Short-Term Memory (LSTM) neural network as the language decoder. To further enhance model performance, it incorporate the Lightning Search Algorithm (LSA), a nature-inspired metaheuristic algorithm, to automatically tune critical hyperparameters, including learning rate, dropout rate, and LSTM units. This automated optimization strategy improves both the quality of generated captions and the training performance. The proposed system is trained and evaluated on the Flickr30k dataset, achieving competitive performance across standard metrics such as BLEU, METEOR, and ROUGE. The results demonstrate that combining transformer-based vision encoders with recurrent language decoders, along with dynamic hyperparameter tuning algorithms, leads to more accurate and proficient image descriptions. This work contributes to the advancement of hybrid deep learning frameworks for image captioning tasks.

References

R. Castro, I. Pineda, W. Lim, and M. E. Morocho-Cayamcela, “Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks,” IEEE Access, vol. 10, pp. 33679–33694, 2022, doi: 10.1109/ACCESS.2022.3161428.
K. Nguyen, D. C. Bui, T. Trinh, and N. D. Vo, “EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning,” IEEE Access, vol. 10, pp. 32443–32452, 2022, doi: 10.1109/ACCESS.2022.3158763.
H. Zhang, C. Ma, Z. Jiang, and J. Lian, “Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s,” IEEE Access, vol. 11, pp. 134–143, 2023, doi: 10.1109/ACCESS.2022.3232508.
L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical LSTMs with Adaptive Attention for Visual Captioning,” IEEE Trans Pattern Anal Mach Intell, vol. 42, no. 5, pp. 1112–1131, May 2020, doi: 10.1109/TPAMI.2019.2894139.
Y. Wang, N. Xu, A. A. Liu, W. Li, and Y. Zhang, “High-Order Interaction Learning for Image Captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4417–4430, Jul. 2022, doi: 10.1109/TCSVT.2021.3121062.
J. Zhang, Z. Fang, H. Sun, and Z. Wang, “Adaptive Semantic-Enhanced Transformer for Image Captioning,” IEEE Trans Neural Netw Learn Syst, vol. 35, no. 2, pp. 1785–1796, Feb. 2024, doi: 10.1109/TNNLS.2022.3185320.
W. Jiang, W. Zhou, and H. Hu, “Double-Stream Position Learning Transformer Network for Image Captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7706–7718, Nov. 2022, doi: 10.1109/TCSVT.2022.3181490.
L. Meng, J. Wang, R. Meng, Y. Yang, and L. Xiao, “A Multiscale Grouping Transformer with CLIP Latents for Remote Sensing Image Captioning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024, doi: 10.1109/TGRS.2024.3385500.
R. O. Alnashwan, S. A. Chelloug, N. S. Almalki, I. Issaoui, A. Motwakel, and A. Sayed, “Lighting Search Algorithm With Convolutional Neural Network-Based Image Captioning System for Natural Language Processing,” IEEE Access, vol. 11, pp. 142643–142651, 2023, doi: 10.1109/ACCESS.2023.3342703.
M. A. Arasi, H. M. Alshahrani, N. Alruwais, A. Motwakel, N. A. Ahmed, and A. Mohamed, “Automated Image Captioning Using Sparrow Search Algorithm With Improved Deep Learning Model,” IEEE Access, vol. 11, pp. 104633–104642, 2023, doi: 10.1109/ACCESS.2023.3317276.
D. A. Hafeth, S. Kollias, and M. Ghafoor, “Semantic Representations With Attention Networks for Boosting Image Captioning,” IEEE Access, vol. 11, pp. 40230–40239, 2023, doi: 10.1109/ACCESS.2023.3268744.
Z. Zhang, Q. Wu, Y. Wang, and F. Chen, “High-Quality Image Captioning with Fine-Grained and Semantic-Guided Visual Attention,” IEEE Trans Multimedia, vol. 21, no. 7, pp. 1681–1693, Jul. 2019, doi: 10.1109/TMM.2018.2888822.
I. Phueaksri, M. A. Kastner, Y. Kawanishi, T. Komamizu, and I. Ide, “An Approach to Generate a Caption for an Image Collection Using Scene Graph Generation,” IEEE Access, vol. 11, pp. 128245–128260, 2023, doi: 10.1109/ACCESS.2023.3332098.
A. Ueda, W. Yang, and K. Sugiura, “Switching Text-Based Image Encoders for Captioning Images With Text,” IEEE Access, vol. 11, pp. 55706–55715, 2023, doi: 10.1109/ACCESS.2023.3282444.
S. Chang and P. Ghamisi, “Changes to Captions: An Attentive Network for Remote Sensing Change Captioning,” IEEE Transactions on Image Processing, vol. 32, pp. 6047–6060, 2023, doi: 10.1109/TIP.2023.3328224.
Q. Wang, W. Huang, X. Zhang, and X. Li, “GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning,” IEEE Trans Cybern, vol. 53, no. 11, pp. 6910–6922, Nov. 2023, doi: 10.1109/TCYB.2022.3222606.
C. Liu, R. Zhao, H. Chen, Z. Zou, and Z. Shi, “Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, 2022, doi: 10.1109/TGRS.2022.3218921.
N. Xu et al., “Multi-Level Policy and Reward-Based Deep Reinforcement Learning Framework for Image Captioning,” IEEE Trans Multimedia, vol. 22, no. 5, pp. 1372–1383, May 2020, doi: 10.1109/TMM.2019.2941820.
J. W. Bae, S. H. Lee, W. Y. Kim, J. H. Seong, and D. H. Seo, “Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary,” IEEE Access, vol. 10, pp. 45219–45229, 2022, doi: 10.1109/ACCESS.2022.3169781.
J. Zhang, K. Mei, Y. Zheng, and J. Fan, “Integrating Part of Speech Guidance for Image Captioning,” IEEE Trans Multimedia, vol. 23, pp. 92–104, 2021, doi: 10.1109/TMM.2020.2976552.
Y. Jing, X. Zhiwei, and G. Guanglai, “Context-Driven Image Caption with Global Semantic Relations of the Named Entities,” IEEE Access, vol. 8, pp. 143584–143594, 2020, doi: 10.1109/ACCESS.2020.3013321.
A. Jamil et al., “Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential,” 2024, doi: 10.1109/ACCESS.2017.DOI.
A. U. Haque, S. Ghani, and M. Saeed, “Image Captioning with Positional and Geometrical Semantics,” IEEE Access, vol. 9, pp. 160917–160925, 2021, doi: 10.1109/ACCESS.2021.3131343.
S. K. Im and K. H. Chan, “Context-Adaptive-Based Image Captioning by Bi-CARU,” IEEE Access, vol. 11, pp. 84934–84943, 2023, doi: 10.1109/ACCESS.2023.3302512.
Z. Zhou et al., “An Image Captioning Model Based on Bidirectional Depth Residuals and its Application,” IEEE Access, vol. 9, pp. 25360–25370, 2021, doi: 10.1109/ACCESS.2021.3057091.
N. Ding, C. Deng, M. Tan, Q. Du, Z. Ge, and Q. Wu, “Image Captioning with Controllable and Adaptive Length Levels,” IEEE Trans Pattern Anal Mach Intell, vol. 46, no. 2, pp. 764–779, Feb. 2024, doi: 10.1109/TPAMI.2023.3328298.
C. Wu, S. Yuan, H. Cao, Y. Wei, and L. Wang, “Hierarchical attention-based fusion for image caption with multi-grained rewards,” IEEE Access, vol. 8, pp. 57943–57951, 2020, doi: 10.1109/ACCESS.2020.2981513.
L. Yu, J. Zhang, and Q. Wu, “Dual Attention on Pyramid Feature Maps for Image Captioning,” IEEE Trans Multimedia, vol. 24, pp. 1775–1786, 2022, doi: 10.1109/TMM.2021.3072479.
S. Chang and P. Ghamisi, “Changes to Captions: An Attentive Network for Remote Sensing Change Captioning,” IEEE Transactions on Image Processing, vol. 32, pp. 6047–6060, 2023, doi: 10.1109/TIP.2023.3328224.
J. Wang et al., “Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation,” IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024, doi: 10.1109/LGRS.2024.3366984.
L. Zhou, Y. Zhang, Y. G. Jiang, T. Zhang, and W. Fan, “Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning,” IEEE Transactions on Image Processing, vol. 29, pp. 694–709, 2020, doi: 10.1109/TIP.2019.2928144.
Z. Zhang, W. Zhang, M. Yan, X. Gao, K. Fu, and X. Sun, “Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, 2022, doi: 10.1109/TGRS.2021.3132095.
C. Yan et al., “Task-Adaptive Attention for Image Captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 43–51, Jan. 2022, doi: 10.1109/TCSVT.2021.3067449.
N. Thanyawet, P. Ratsamee, Y. Uranishi, M. Kobayashi, and H. Takemura, “Identifying Disaster Regions in Images Through Attention Shifting with a Retarget Network,” IEEE Access, 2024, doi: 10.1109/ACCESS.2024.3443130.
T. Wei, W. Yuan, J. Luo, W. Zhang, and L. Lu, “VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning,” Journal of Systems Engineering and Electronics, vol. 34, no. 1, pp. 9–18, Feb. 2023, doi: 10.23919/JSEE.2023.000035.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Image captioning Vision Transformer (ViT) Long Short Term Memory (LSTM) Lightning Search Algorithm (LSA) Deep learning Hyperparameter optimization Natural language processing Beam Search Greedy Search