International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
Volume 187 - Issue 15 |
Published: June 2025 |
Authors: Md. Asraful Islam Khan, Syful Islam |
![]() |
Md. Asraful Islam Khan, Syful Islam . Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data. International Journal of Computer Applications. 187, 15 (June 2025), 19-26. DOI=10.5120/ijca2025925191
@article{ 10.5120/ijca2025925191, author = { Md. Asraful Islam Khan,Syful Islam }, title = { Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data }, journal = { International Journal of Computer Applications }, year = { 2025 }, volume = { 187 }, number = { 15 }, pages = { 19-26 }, doi = { 10.5120/ijca2025925191 }, publisher = { Foundation of Computer Science (FCS), NY, USA } }
%0 Journal Article %D 2025 %A Md. Asraful Islam Khan %A Syful Islam %T Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth,and Skeleton Data%T %J International Journal of Computer Applications %V 187 %N 15 %P 19-26 %R 10.5120/ijca2025925191 %I Foundation of Computer Science (FCS), NY, USA
Recognizing hand gestures is essential to human-computer interaction because it allows for organic and intuitive interaction in virtual reality, robotics, and assistive technologies. In this work, we suggest a unique multimodal fusion structure that integrates RGB images, depth information, and skeleton-based GCN features to enhance gesture recognition under realistic, noisy data conditions. Our architecture leverages MobileNetV3Small-based CNN backbones for visual feature extraction, GCNs for modeling skeletal relationships, and LSTM-attention modules for capturing temporal dynamics. Unlike previous works that rely on large curated datasets, our approach is evaluated on a challenging lowsample, high-noise dataset derived from real-world video recordings. Through systematic ablation studies, we demonstrate that incorporating depth and skeleton features incrementally improves performance, validating the strength of our fusion strategy. Despite operating under small and noisy data regimes, our model achieves meaningful accuracy, and our analysis provides insights into modality-specific failure cases. The proposed system paves the way for developing robust gesture recognition solutions deployable in real-world environments with minimal data preprocessing.