A Hybrid LSTM-CNN Approach for Multimodal Sentiment Analysis: Combining Text and Image Features

Zannirah Muhammed Sammani; Mohammed Abo Rizka

Research Article

A Hybrid LSTM-CNN Approach for Multimodal Sentiment Analysis: Combining Text and Image Features

by Zannirah Muhammed Sammani, Mohammed Abo Rizka

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 30

Published: August 2025

Authors: Zannirah Muhammed Sammani, Mohammed Abo Rizka

10.5120/ijca2025925526

PDF

Zannirah Muhammed Sammani, Mohammed Abo Rizka . A Hybrid LSTM-CNN Approach for Multimodal Sentiment Analysis: Combining Text and Image Features. International Journal of Computer Applications. 187, 30 (August 2025), 34-42. DOI=10.5120/ijca2025925526

                        @article{ 10.5120/ijca2025925526,
                        author  = { Zannirah Muhammed Sammani,Mohammed Abo Rizka },
                        title   = { A Hybrid LSTM-CNN Approach for Multimodal Sentiment Analysis: Combining Text and Image Features },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 30 },
                        pages   = { 34-42 },
                        doi     = { 10.5120/ijca2025925526 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Zannirah Muhammed Sammani
                        %A Mohammed Abo Rizka
                        %T A Hybrid LSTM-CNN Approach for Multimodal Sentiment Analysis: Combining Text and Image Features%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 30
                        %P 34-42
                        %R 10.5120/ijca2025925526
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

An efficient deep learning framework is proposed for sentiment analysis that leverages both textual and visual modalities. The architecture integrates Long Short-Term Memory (LSTM) networks for capturing sequential dependencies in textual data with Convolutional Neural Networks (CNNs) for analyzing visual content. This multimodal fusion enhances sentiment classification accuracy. The model is assessed on two benchmark datasets—Memes and MVSA—and its performance is compared to traditional machine learning models such as Support Vector Machines and Logistic Regression, as well as the transformer-based VisualBERT. Although VisualBERT achieves slightly higher accuracy (83.18% on Memes and 81.29% on MVSA), the proposed approach delivers comparable results (77.70% and 80.42%, respectively) while maintaining a much lower computational footprint. This balance between performance and efficiency highlights the model’s practical value for applications where computational resources are limited or real-time analysis is required.

References

Xu, C., Cetintas, S., Lee, K.-C., and Li, L.-J. 2014. Visual sentiment prediction with deep convolutional neural networks. arXiv:1411.5731v1.
Qiu, K., Zhang, Y., Zhao, J., Zhang, S., Wang, Q., and Chen, F. 2024. A multimodal sentiment analysis approach based on a joint chained interactive attention mechanism. Electronics, 13(1), 1922.
Al-Tameemi, I. K. S., Feizi-Derakhshi, M.-R., Pashazadeh, S., and Asadpour, M. 2024. A comprehensive review of visual–textual sentiment analysis from social media networks. Journal of Computational Social Science, 7(3), 2767–2838.
Sánchez Villegas, D., Preoțiuc-Pietro, D., and Aletras, N. 2024. Improving multimodal classification of social media posts by leveraging image-text auxiliary tasks. arXiv:2309.07794v2.
Dao, P. Q., Roantree, M., Nguyen-Tat, T. B., and Ngo, V. M. 2024. Exploring multimodal sentiment analysis models: A comprehensive survey. Preprints.
Liu, B. 2012. Sentiment analysis and opinion mining. Morgan & Claypool Publishers.
Jiang, T., Wang, J., Liu, Z., and Ling, Y. 2020. Fusion-extraction network for multimodal sentiment analysis. In Advances in Knowledge Discovery and Data Mining, Vol. 12085, 785–797. Springer.
Dang, N. C., Moreno-García, M. N., and De la Prieta, F. 2020. Sentiment analysis based on deep learning: A comparative study. Data, 5(2), 35.
Li, H., Lu, Y., and Zhu, H. 2024. Multi-modal sentiment analysis based on image and text fusion based on cross-attention mechanism. Electronics, 13, 2069.
Aliman, G. B., Nivera, T. F. S., Olazo, J. C. A., Ramos, D. J. P., Sanchez, C. D. B., Amado, T. M., Arago, N. M., Jorda Jr., R. L., Virrey, G. C., and Valenzuela, I. C. 2022. Sentiment analysis using logistic regression.
Qixuan, Y. 2024. Three-class text sentiment analysis based on LSTM. Preprint submitted to Computer, Zhongnan University of Economics and Law.
15. You, Q., Jin, H., and Luo, J. 2017. Visual sentiment analysis by attending on local image regions. In Proceedings of the AAAI Conference on Artificial Intelligence. Retrieved from www.aaai.org.
16. You, Q., Luo, J., Jin, H., and Yang, J. 2016. Building a Large-Scale Dataset for Image Emotion Recognition: The Fine Print and the Benchmark. Proceedings of the AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence. Retrieved from www.aaai.org.
12. You, Q. and Luo, J. 2015. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In Proceedings of the AAAI Conference on Artificial Intelligence.
13. Yang, J., She, D., and Sun, M. 2017. Joint image emotion classification and distribution learning via deep convolutional neural network. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17).
14. Dongyu, S., Yang, J., Cheng, M.-M., Lai, Y., Rosin, P., and Liang, W. 2020. WSCNet: Weakly supervised coupled networks for visual sentiment classification and detection. IEEE Transactions on Multimedia, 22(5), 1358–1371
17. Chen, T., Borth, D., Darrell, T., and Chang, S.-F. 2014. DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME).
18. Jiang, T., Wang, J., Liu, Z., and Ling, Y. 2020. Fusion-Extraction Network for Multimodal Sentiment Analysis. In H. W. Lauw et al. (Eds.), Proceedings of the PAKDD 2020 (Vol. 12085, pp. 785–797). Springer Nature. https://doi.org/10.1007/978-3-030-47436-2_59
Dao, P. Q., Roantree, M., Nguyen-Tat, T. B., and Ngo, V. M. 2024. Exploring Multimodal Sentiment Analysis Models: A Comprehensive Survey. Preprints. https://doi.org/10.20944/preprints202408.0127.v1
Luo, X. Y., Liu, J., Lin, P., and Fan, Y. 2021. Multimodal sentiment analysis based on deep learning: Recent progress. In Proceedings of The International Conference on Electronic Business (ICEB’21), Vol. 21, 293–303. Nanjing, China, December 3–7, 2021.
Hakimov, S., Cheema, G. S., and Ewerth, R. 2025. Processing multimodal information: Challenges and solutions for multimodal sentiment analysis and hate speech detection. In I. Marenzi et al. (Eds.), Event Analytics across Languages and Communities, 71–94. Springer. https://doi.org/10.1007/978-3-031-64451-1_4
Li, H., Lu, Y., and Zhu, H. 2024. Multi-modal sentiment analysis based on image and text fusion based on cross-attention mechanism. Electronics, 13(11), 2069. https://doi.org/10.3390/electronics13112069
Su, J., Liang, J., Zhu, J., and Li, Y. 2024. HCAM-CL: A novel method integrating a hierarchical cross-attention mechanism with CNN-LSTM for hierarchical image classification. Symmetry, 16(9), 1231. https://doi.org/10.3390/sym16091231
Arevalo, J., Montes-y-Gómez, M., Solorio, T., and González, F. A. 2017. Gated Multimodal Units for Information Fusion. arXiv:1702.01992v1 [stat.ML]. https://arxiv.org/abs/1702.01992
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv:1908.03557v1 [cs.CV]. https://arxiv.org/abs/1908.03557
Shan, F., Liu, M., Zhang, M., and Wang, Z. 2024. Fake News Detection Based on Cross-Modal Message Aggregation and Gated Fusion Network. Computers, Materials & Continua, 2024, Article 10.32604/cmc.2024.053937. https://doi.org/10.32604/cmc.2024.053937
Fields, C., and Kennington, C. 2023. Exploring transformers as compact, data-efficient language models. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), 521–531.
Wei, L., Wang, Z., Xu, J., Shi, Y., Wang, Q., Shi, L., Tao, Y., and Gao, Y. 2023. A lightweight sentiment analysis framework for a micro-intelligent terminal. Sensors, 23(2), 741. https://doi.org/10.3390/s23020741
Pareek, P., Sharma, N., Ghosh, A., and Nagarohith, K. 2022. Sentiment analysis for Amazon product reviews using logistic regression model. Journal of Development Economics and Management Research Studies, 09(11), 29–42. https://doi.org/10.53422/09(11), 29-42
Suryawanshi, S., Chakravarthi, B. R., Arcan, M., and Buitelaar, P. 2020. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, 32–41. Language Resources and Evaluation Conference (LREC 2020), Marseille, May 11–16, 2020. European Language Resources Association (ELRA).
Barnes, K., Juhász, P., Nagy, M., and Molontay, R. 2024. Topicality boosts popularity: A comparative analysis of NYT articles and Reddit memes. Social Network Analysis and Mining, 14, 119. https://doi.org/10.1007/s13278-024-01272-3
Schmidt, L., Talwar, K., Santurkar, S., and Tsipras, D. 2018. Adversarially robust generalization requires more data. Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
Hoffer, E., Hubara, I., and Soudry, D. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
Aleqabie, H. J., Sfoq, M. S., Albeer, R. A., and Abd, E. H. 2024. Review of text mining techniques: Trends and applications in various domains. International Journal of Computer Science and Management, 5(1), 1–9. https://doi.org/10.52866/ijcsm.2024.05.01.009
Kalra, V., and Aggarwal, R. 2018. Importance of text data preprocessing & implementation in RapidMiner. In Proceedings of the First International Conference on Information Technology and Knowledge Management, Vol. 14, 71–75. https://doi.org/10.15439/2018KM46
Vidyashree, K. P., and Rajendra, A. B. 2023. An improvised sentiment analysis model on Twitter data using stochastic gradient descent (SGD) optimization algorithm in stochastic gate neural network (SGNN). SN Computer Science, 4, 190. https://doi.org/10.1007/s42979-022-01607-x
Liu, C., Sheng, Y., Wei, Z., and Yang, Y. Q. 2018. Research of text classification based on improved TF-IDF algorithm. In Proceedings of the International Conference of Intelligent Robotic and Control Engineering. College of Information Science & Engineering, Ocean University of China.
Valente, J., António, J., Mora, C., and Jardim, S. 2023. Developments in image processing using deep learning and reinforcement learning. Journal of Imaging, 9(10), 207. https://doi.org/10.3390/jimaging9100207
Tachibana, Y., Obata, T., Kershaw, J., Sakaki, H., Urushihata, T., Omatsu, T., Kishimoto, R., and Higashi, T. 2019. The utility of applying various image preprocessing strategies to reduce the ambiguity in deep learning-based clinical image diagnosis. Magnetic Resonance in Medical Sciences, 19, 92–98. https://doi.org/10.2463/mrms.mp.2019-0021
Murcia-Gómez, D., Rojas-Valenzuela, I., and Valenzuela, O. 2022. Impact of image preprocessing methods and deep learning models for classifying histopathological breast cancer images. Applied Sciences, 12(22), 11375. https://doi.org/10.3390/app122211375
Barnes, K., Juhász, P., Nagy, M., and Molontay, R. 2024. Topicality boosts popularity: A comparative analysis of NYT articles and Reddit memes. Social Network Analysis and Mining, 14, 119. https://doi.org/10.1007/s13278-024-01272-3
Guo, R., Wei, J., Sun, L., Yu, B., Chang, G., Liu, D., Zhang, S., Yao, Z., Xu, M., and Bu, L. 2024. A survey on advancements in image-text multimodal models: From general techniques to biomedical implementations. arXiv preprint arXiv:2309.15857.
Jiang, M., and Ji, S. 2022. Cross-modality gated attention Jian fusion for multimodal sentiment analysis. arXiv preprint arXiv:2208.11893.
Bai, Y., Yang, E., Han, B., Yang, Y., Li, J., Mao, Y., Niu, G., and Liu, T. 2021. Understanding and Improving Early Stopping for Learning with Noisy Labels. arXiv preprint arXiv:2106.15853.
Ren, J. 2024. Multimodal Sentiment Analysis Based on BERT and ResNet. School of Information and Engineering, Zhongnan University of Economics and Law. arXiv preprint arXiv:2412.03625v1
Majumder, S., Aich, A., and Das, S. 2021. Sentiment analysis of people during the lockdown period of COVID-19 using SVM and logistic regression analysis.
Henderi, & Siddique, Q. (2024). Comparative analysis of sentiment classification techniques on Flipkart product reviews: A study using logistic regression, SVC, random forest, and gradient boosting. *Journal of Data Mining and Decision Making*, 1(1), 4. https://doi.org/10.47738/jdmdc.v1i1.4
Chiny, M., Chihab, M., Chihab, Y., & Bencharef, O. (2021). LSTM, VADER, and TF-IDF based hybrid sentiment analysis model. *International Journal of Advanced Computer Science and Applications*, 12(7).
Ur Rehman, A., Malik, A. K., Raza, B., & Ali, W. (2019). A hybrid CNN-LSTM model for improving accuracy of movie reviews sentiment analysis. *Multimedia Tools and Applications*, 78, 26597–26613. https://doi.org/10.1007/s11042-019-07788-7
Meena, G., Mohbe, K. K., & Kumar, S. (2023). Sentiment analysis on images using convolutional neural networks-based Inception-V3 transfer learning approach. *International Journal of Information Management Data Insights*,3, 100174.
Bart, M. P., Savino, N. J., Regmi, P., Cohen, L., Safavi, H., Shaw, H. C., Lohani, S., Searles, T. A., Kirby, B. T., Lee, H., and Glasser, R. T. 2022. Deep learning for enhanced free-space optical communications. arXiv. https://arxiv.org/abs/2208.07712
Kour, H., & Gupta, M. K. (2022). An hybrid deep learning approach for depression prediction from user tweets using feature-rich CNN and bi- directional LSTM. *Multimedia Tools and Applications*, 81, 23649–23685. https://doi.org/10.1007/s11042-022-12648-y
Vaydande, R. 2022. Retinal Fundus Image Classification using LSTM - Convolution Neural Network. MSc Research Project, Data Analytics. National College of Ireland, School of Computing. Supervisor: Vladimir Milosavljevic.
Li, T., Hua, M., and Wu, X. 2020. A Hybrid CNN-LSTM Model for Forecasting Particulate Matter (PM2.5). IEEE Access, Special Section on Feature Representation and Learning Methods with Applications in Large-Scale Biological Sequence Analysis. https://doi.org/10.1109/ACCESS.2020.2971348
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. 2019. VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
Bandyopadhyay, D., Hasanuzzaman, M., and Ekbal, A. 2024. Seeing through VisualBERT: A causal adventure on memetic landscapes. arXiv preprint arXiv:2410.13488.
Yang, H., Zhao, Y., Wu, Y., aWang, S., Zheng, T., Zhang, H., Ma, Z., Che, W., and Qin, B. 2024. Large Language Models Meet Text-Centric Multimodal Sentiment Analysis: A Survey. arXiv:2406.08068v2 [cs.CL]. https://arxiv.org/abs/2406.080

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Convolutional Neural Networks (CNNs) Deep Learning Hybrid Models Long Short- Term Memory (LSTM) Multimodal Sentiment Analysis