Multimodal Gesture Recognition using  CNN-GCN-LSTM with RGB, Depth,and  Skeleton Data

Md. Asraful Islam Khan; Syful Islam

Research Article

Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data

by Md. Asraful Islam Khan, Syful Islam

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 15

Published: June 2025

Authors: Md. Asraful Islam Khan, Syful Islam

10.5120/ijca2025925191

PDF

Md. Asraful Islam Khan, Syful Islam . Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data. International Journal of Computer Applications. 187, 15 (June 2025), 19-26. DOI=10.5120/ijca2025925191

                        @article{ 10.5120/ijca2025925191,
                        author  = { Md. Asraful Islam Khan,Syful Islam },
                        title   = { Multimodal Gesture Recognition using  CNN-GCN-LSTM with RGB, Depth,and  Skeleton Data },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 15 },
                        pages   = { 19-26 },
                        doi     = { 10.5120/ijca2025925191 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Md. Asraful Islam Khan
                        %A Syful Islam
                        %T Multimodal Gesture Recognition using  CNN-GCN-LSTM with RGB, Depth,and  Skeleton Data%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 15
                        %P 19-26
                        %R 10.5120/ijca2025925191
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Recognizing hand gestures is essential to human-computer interaction because it allows for organic and intuitive interaction in virtual reality, robotics, and assistive technologies. In this work, we suggest a unique multimodal fusion structure that integrates RGB images, depth information, and skeleton-based GCN features to enhance gesture recognition under realistic, noisy data conditions. Our architecture leverages MobileNetV3Small-based CNN backbones for visual feature extraction, GCNs for modeling skeletal relationships, and LSTM-attention modules for capturing temporal dynamics. Unlike previous works that rely on large curated datasets, our approach is evaluated on a challenging lowsample, high-noise dataset derived from real-world video recordings. Through systematic ablation studies, we demonstrate that incorporating depth and skeleton features incrementally improves performance, validating the strength of our fusion strategy. Despite operating under small and noisy data regimes, our model achieves meaningful accuracy, and our analysis provides insights into modality-specific failure cases. The proposed system paves the way for developing robust gesture recognition solutions deployable in real-world environments with minimal data preprocessing.

References

A. S. M. Miah, M. A. M. Hasan, and J. Shin, “Dynamic hand gesture recognition using multi-branch attention based graph and general deep learning model,” IEEE Access, vol. 11, p. 4703–4716, 2023.
O. Yusuf, M. Habib, and M. Moustafa, “Real-time hand gesture recognition: Integrating skeleton-based data fusion and multi-stream cnn,” 2024.
B. Kwolek, “Continuous hand gesture recognition for humanrobot collaborative assembly,” in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE, Oct. 2023, p. 1992–1999.
A. S. M. Miah, M. A. M. Hasan, Y. Tomioka, and J. Shin, “Hand gesture recognition for multi-culture sign language using graph and general deep learning network,” IEEE Open Journal of the Computer Society, vol. 5, p. 144–155, 2024.
Y. Han, Y. Han, and Q. Jiang, “A study on the stgcn-lstm sign language recognition model based on phonological features of sign language,” IEEE Access, p. 1–1, 2025.
R. Slama, W. Rabah, and H. Wannous, “Online hand gesture recognition using continual graph transformers,” 2025.
J. Song, H. Wang, J. Li, J. Zheng, Z. Zhao, and Q. Li, “Handaware graph convolution network for skeleton-based sign language recognition,” Journal of Information and Intelligence, vol. 3, no. 1, p. 36–50, Jan. 2025.
H. Lee, M. Jiang, J. Yang, Z. Yang, and Q. Zhao, “Decoding gestures in electromyography: Spatiotemporal graph neural networks for generalizable and interpretable classification,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 33, p. 404–419, 2025.
J. Shin, A. S. M. Miah, S. Konnai, I. Takahashi, and K. Hirooka, “Hand gesture recognition using semg signals with a multi-stream time-varying feature enhancement approach,” Scientific Reports, vol. 14, no. 1, Sep. 2024.
M. Linardakis, I. Varlamis, and G. T. Papadopoulos, “Survey on hand gesture recognition from visual input,” 2025. [Online]. Available: https://arxiv.org/abs/2501.11992
Y. Li and J. Zhang, “Sl-gcnn: A graph convolutional neural network for granular human motion recognition,” IEEE Access, vol. 13, p. 12373–12387, 2025. [Online]. Available: http://dx.doi.org/10.1109/ACCESS.2024.3514082
H. Cui, R. Huang, R. Zhang, and T. Hayama, “Dstsagcn: Advancing skeleton-based gesture recognition with semantic-aware spatio-temporal topology modeling,” arXiv preprint arXiv:2501.12086, 2025. [Online]. Available: https: //arxiv.org/abs/2501.12086
O. Ikne, B. Allaert, and H. Wannous, “Skeleton-based selfsupervised feature extraction for improved dynamic hand gesture recognition,” arXiv preprint arXiv:2405.12345, 2024. [Online]. Available: https://arxiv.org/abs/2405.12345
M. Garg, D. Ghosh, and P. M. Pradhan, “Gestformer: Multiscale wavelet pooling transformer network for dynamic hand gesture recognition,” arXiv preprint arXiv:2405.11180, 2024. [Online]. Available: https://arxiv.org/abs/2405.11180
Y. Liu, Z. Wang, and L. Chen, “Spatio-temporal transformer with kolmogorov–arnold network for skeleton-based hand gesture recognition,” Sensors, vol. 25, no. 3, p. 702, 2025.
M. A. Rahim, A. S. M. Miah, H. S. Akash, J. Shin, M. I. Hossain, and M. N. Hossain, “An advanced deep learning based three-stream hybrid model for dynamic hand gesture recognition,” arXiv preprint arXiv:2408.08035, 2024. [Online]. Available: https://arxiv.org/abs/2408.08035
H. Mahmud, M. M. Morshed, and M. K. Hasan, “A deep learning-based multimodal depth-aware dynamic hand gesture recognition system,” arXiv preprint arXiv:2307.12345, 2024. [Online]. Available: https://arxiv.org/abs/2307.12345
J.-H. Kim, S.-M. Park, and D.-H. Lee, “Multi-modal zeroshot dynamic hand gesture recognition,” Expert Systems with Applications, vol. 213, p. 119123, 2024.
R. Singh, A. Kumar, and P. Sharma, “Electromyographic hand gesture recognition using convolutional neural networks with multi-attention mechanisms,” Biomedical Signal Processing and Control, vol. 86, p. 104865, 2024.
R. Patel and A. Singh, “Attention-driven hybrid lstm-gru model for enhanced emg-based hand gesture recognition,” SSRG International Journal of Electrical and Electronics Engineering, vol. 11, no. 11, p. 106, 2024.
W. Zhang, M. Li, and X. Chen, “Gesture recognition with residual lstm attention using millimeter-wave radar,” Sensors, vol. 25, no. 2, p. 469, 2025.
M. I. Md Selim Sarowar Nur E Jannatul Farjana Md. Asraful Islam Khan, Md Abdul Mutalib Syful Islam, “Hand gesture recognition systems: A review of methods, datasets, and emerging trends,” International Journal of Computer Applications, vol. 187, no. 2, pp. 1–33, May 2025. [Online]. Available: https://doi.org/10.5120/ijca2025924776

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Hand Gesture Recognition Multimodal Fusion GCN LSTM Depth Skeleton CNN