AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM

Rekha S. Kotwal; Geetanjali Jindal

Research Article

AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM

by Rekha S. Kotwal, Geetanjali Jindal

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 2

Published: May 2025

Authors: Rekha S. Kotwal, Geetanjali Jindal

10.5120/ijca2025924807

PDF

Rekha S. Kotwal, Geetanjali Jindal . AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM. International Journal of Computer Applications. 187, 2 (May 2025), 72-81. DOI=10.5120/ijca2025924807

                        @article{ 10.5120/ijca2025924807,
                        author  = { Rekha S. Kotwal,Geetanjali Jindal },
                        title   = { AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 2 },
                        pages   = { 72-81 },
                        doi     = { 10.5120/ijca2025924807 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Rekha S. Kotwal
                        %A Geetanjali Jindal
                        %T AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 2
                        %P 72-81
                        %R 10.5120/ijca2025924807
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

The objective of current project is for developing deep learning (DL)-based speech emotion detection system that may identify and categorize emotional states including happiness and sadness. For capturing spatial and temporal patterns in audio input, system uses mel-spectrogram features, that are processed employing hybrid model that combines "convolutional neural networks (CNNs)" and "long short-term memory networks (LSTMs)". Pre-trained model's efficacy in this field is further demonstrated by refinement of transformer-based Wav2Vec2 model for emotion classification. The provided methods accurately identify speech emotions, thus being beneficial for customer service, healthcare, and human-computer interaction.

References

M. Xu, F. Zhang, and W. Zhang, “Head fusion: Improving theaccuracy and robustness of speech emotion recognition on theIEMOCAP and RAVDESS dataset,” IEEE Access, vol. 9, pp.74539–74549,2021.
Farooq,M.;Hussain,F.;Baloch,N.K.;Raja,F.R.;Yu,H.;Zikria, Y.B. Impact of Feature Selection Algorithm on Speech EmotionRecognitionUsingDeepConvolutionalNeuralNetwork. Sensors2020, 20, 6008. https://doi.org/10.3390/s20216008
K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292, 2020
Mustaqeem; Kwon, S. A CNN-Assisted Enhanced Audio SignalProcessingforSpeechEmotionRecognition. Sensors2020,20,183.https://doi.org/10.3390/s20010183
K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292,2020.https://doi.org/10.1016/j.bspc.2020.101894
D.Issa,M.F.Demirci,andA.Yazici,“Speechemotionrecognitionwith deep convolutional neural networks,” Biomed. SignalProcess. Control, vol. 59, 2020, Art. no. 101894
H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” 2014, arXiv:1402.1128
K. K. Kishore and P. K. Satish, “Emotion recognition in speechusingMFCCandwaveletfeatures,”inProc. IEEE 3rd Int.Adv. Comput.Conf.,2013,pp.842–847
Alnuaim, A. A., Zakariah, M., Shukla, P. K., Alhadlaq, A.,Hatamleh, W. A., Tarazi, H., Sureshbabu, R., & Ratna, R. (Year).Human-Computer Interaction for Recognizing Speech EmotionsUsing Multilayer Perceptron Classifier. Publisher.
M. Xu, F. Zhang, and W. Zhang, “Head fusion: Improving theaccuracy and robustness of speech emotion recognition on theIEMOCAP and RAVDESS dataset,” IEEE Access, vol. 9, pp.74539–74549.
Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria,Y.B., “Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network,” Sensors,vol. 20, no. 6008. https://doi.org/10.3390/s20216008
K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292.
Mustaqeem, K.; Kwon, S., “A CNN-Assisted Enhanced AudioSignal Processing for SpeechEmotion Recognition,” Sensors, vol.20, no. 183. https://doi.org/10.3390/s20010183
K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292. https://doi.org/10.1016/j.bspc.2020.101894
D. Issa, M. F. Demirci, and A. Yazici, “Speech emotionrecognition with deep convolutional neural networks,” Biomed.Signal Process. Control, vol. 59, Art. no. 101894.
H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” arXiv:1402.1128.
K. K. Kishore and P. K. Satish, “Emotion recognition in speechusing MFCC and wavelet features,” in Proc. IEEE 3rd Int. Adv.Comput. Conf., pp. 842–847.
Alnuaim, A. A., Zakariah, M., Shukla, P. K., Alhadlaq, A.,Hatamleh, W. A., Tarazi, H., Sureshbabu, R., & Ratna, R.,“Human-Computer Interaction for Recognizing Speech EmotionsUsing Multilayer Perceptron Classifier.”
S. Upadhyay, V. Kumar, and R. Singh, “Cross-corpus SpeechEmotion Recognition using Self-supervised Learning Models,”IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 489–500.
Chen, W.; Wu, J.; Zhang, Z.; Wang, Y., “Deep learning-basedspeech emotion recognition with multi-scale feature fusion,”Neural Networks, vol. 136, pp. 20–30.
B. Liu, J. Tao, Z. Lian, and Z. Wen, “Exploiting LabelDependency for Speech EmotionRecognition UsingGraph NeuralNetworks,” IEEETrans.Affect.Comput.,vol.13,no.4,pp.1849– 1862.
Alnuaim, A. A., et al., “Human-Computer Interaction forRecognizing Speech Emotions Using Multilayer PerceptronClassifier,” Neural Comput. Appl., 2023.
Yang, Y., et al., “Attention-based Convolutional Recurrent NeuralNetworks for Speech Emotion Recognition,” IEEE Trans. Affect.Comput., vol. 13, no. 2, pp. 1016–1027.
Tsai,W.-C.,etal.,“MultimodalSpeech EmotionRecognition withTransformer-basedAudio-TextFusion,”Proc.Interspeech2022, pp. 2340–2344.
Feng, L., et al., “Contrastive Learning for Speech EmotionRecognition,” IEEE ICASSP, pp. 11201–11205.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Mel-spectrogram deep learning speech emotion detection CNN LSTM Wav2Vec2 emotion classification human- computer interaction.