International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 31 |
Published: August 2025 |
Authors: Chandra Sekhar Sanaboina, Girija Sankar Rotta |
![]() |
Chandra Sekhar Sanaboina, Girija Sankar Rotta . A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM. International Journal of Computer Applications. 187, 31 (August 2025), 10-19. DOI=10.5120/ijca2025925536
@article{ 10.5120/ijca2025925536, author = { Chandra Sekhar Sanaboina,Girija Sankar Rotta }, title = { A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 31 }, pages = { 10-19 }, doi = { 10.5120/ijca2025925536 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A Chandra Sekhar Sanaboina %A Girija Sankar Rotta %T A COMPARATIVE STUDY OF BEAM AND GREEDY DECODING STRATEGIES FOR IMAGE CAPTIONING USING HYBRID VIT-LSTM AND LIGHTNING SEARCH ALGORITHM%T %J International Journal of Computer Applications %V 187 %N 31 %P 10-19 %R 10.5120/ijca2025925536 %I Foundation of Computer Science (FCS), NY, USA
Image captioning, an interrelated task between computer vision and natural language processing, used to generate descriptive textual captions for given images. This paper presents an optimized deep learning-based Image Captioning System (ICS) that employs a Vision Transformer (ViT) as an image feature extractor and a Long Short-Term Memory (LSTM) neural network as the language decoder. To further enhance model performance, it incorporate the Lightning Search Algorithm (LSA), a nature-inspired metaheuristic algorithm, to automatically tune critical hyperparameters, including learning rate, dropout rate, and LSTM units. This automated optimization strategy improves both the quality of generated captions and the training performance. The proposed system is trained and evaluated on the Flickr30k dataset, achieving competitive performance across standard metrics such as BLEU, METEOR, and ROUGE. The results demonstrate that combining transformer-based vision encoders with recurrent language decoders, along with dynamic hyperparameter tuning algorithms, leads to more accurate and proficient image descriptions. This work contributes to the advancement of hybrid deep learning frameworks for image captioning tasks.