Research Article

AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM

by  Rekha S. Kotwal, Geetanjali Jindal
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 2
Published: May 2025
Authors: Rekha S. Kotwal, Geetanjali Jindal
10.5120/ijca2025924807
PDF

Rekha S. Kotwal, Geetanjali Jindal . AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM. International Journal of Computer Applications. 187, 2 (May 2025), 72-81. DOI=10.5120/ijca2025924807

                        @article{ 10.5120/ijca2025924807,
                        author  = { Rekha S. Kotwal,Geetanjali Jindal },
                        title   = { AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 2 },
                        pages   = { 72-81 },
                        doi     = { 10.5120/ijca2025924807 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Rekha S. Kotwal
                        %A Geetanjali Jindal
                        %T AI Powered Speech Recognition System using Wavelet Multi-Resolution Analysis with One- Dimentional CNN-LSTM%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 2
                        %P 72-81
                        %R 10.5120/ijca2025924807
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

The objective of current project is for developing deep learning (DL)-based speech emotion detection system that may identify and categorize emotional states including happiness and sadness. For capturing spatial and temporal patterns in audio input, system uses mel-spectrogram features, that are processed employing hybrid model that combines "convolutional neural networks (CNNs)" and "long short-term memory networks (LSTMs)". Pre-trained model's efficacy in this field is further demonstrated by refinement of transformer-based Wav2Vec2 model for emotion classification. The provided methods accurately identify speech emotions, thus being beneficial for customer service, healthcare, and human-computer interaction.

References
  • M. Xu, F. Zhang, and W. Zhang, “Head fusion: Improving theaccuracy and robustness of speech emotion recognition on theIEMOCAP and RAVDESS dataset,” IEEE Access, vol. 9, pp.74539–74549,2021.
  • Farooq,M.;Hussain,F.;Baloch,N.K.;Raja,F.R.;Yu,H.;Zikria, Y.B. Impact of Feature Selection Algorithm on Speech EmotionRecognitionUsingDeepConvolutionalNeuralNetwork. Sensors2020, 20, 6008. https://doi.org/10.3390/s20216008
  • K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292, 2020
  • Mustaqeem; Kwon, S. A CNN-Assisted Enhanced Audio SignalProcessingforSpeechEmotionRecognition. Sensors2020,20,183.https://doi.org/10.3390/s20010183
  • K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292,2020.https://doi.org/10.1016/j.bspc.2020.101894
  • D.Issa,M.F.Demirci,andA.Yazici,“Speechemotionrecognitionwith deep convolutional neural networks,” Biomed. SignalProcess. Control, vol. 59, 2020, Art. no. 101894
  • H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” 2014, arXiv:1402.1128
  • K. K. Kishore and P. K. Satish, “Emotion recognition in speechusingMFCCandwaveletfeatures,”inProc. IEEE 3rd Int.Adv. Comput.Conf.,2013,pp.842–847
  • Alnuaim, A. A., Zakariah, M., Shukla, P. K., Alhadlaq, A.,Hatamleh, W. A., Tarazi, H., Sureshbabu, R., & Ratna, R. (Year).Human-Computer Interaction for Recognizing Speech EmotionsUsing Multilayer Perceptron Classifier. Publisher.
  • M. Xu, F. Zhang, and W. Zhang, “Head fusion: Improving theaccuracy and robustness of speech emotion recognition on theIEMOCAP and RAVDESS dataset,” IEEE Access, vol. 9, pp.74539–74549.
  • Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria,Y.B., “Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network,” Sensors,vol. 20, no. 6008. https://doi.org/10.3390/s20216008
  • K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292.
  • Mustaqeem, K.; Kwon, S., “A CNN-Assisted Enhanced AudioSignal Processing for SpeechEmotion Recognition,” Sensors, vol.20, no. 183. https://doi.org/10.3390/s20010183
  • K. Aghajani and I. E. P. Afrakoti, “Speech emotion recognitionusing scalogram based deep structure,” Int. J. Eng., vol. 33, no. 2, pp.285–292. https://doi.org/10.1016/j.bspc.2020.101894
  • D. Issa, M. F. Demirci, and A. Yazici, “Speech emotionrecognition with deep convolutional neural networks,” Biomed.Signal Process. Control, vol. 59, Art. no. 101894.
  • H. Sak, A. Senior, and F. Beaufays, “Long short-term memorybased recurrent neural network architectures for large vocabularyspeech recognition,” arXiv:1402.1128.
  • K. K. Kishore and P. K. Satish, “Emotion recognition in speechusing MFCC and wavelet features,” in Proc. IEEE 3rd Int. Adv.Comput. Conf., pp. 842–847.
  • Alnuaim, A. A., Zakariah, M., Shukla, P. K., Alhadlaq, A.,Hatamleh, W. A., Tarazi, H., Sureshbabu, R., & Ratna, R.,“Human-Computer Interaction for Recognizing Speech EmotionsUsing Multilayer Perceptron Classifier.”
  • S. Upadhyay, V. Kumar, and R. Singh, “Cross-corpus SpeechEmotion Recognition using Self-supervised Learning Models,”IEEE Trans. Affect. Comput., vol. 14, no. 2, pp. 489–500.
  • Chen, W.; Wu, J.; Zhang, Z.; Wang, Y., “Deep learning-basedspeech emotion recognition with multi-scale feature fusion,”Neural Networks, vol. 136, pp. 20–30.
  • B. Liu, J. Tao, Z. Lian, and Z. Wen, “Exploiting LabelDependency for Speech EmotionRecognition UsingGraph NeuralNetworks,” IEEETrans.Affect.Comput.,vol.13,no.4,pp.1849– 1862.
  • Alnuaim, A. A., et al., “Human-Computer Interaction forRecognizing Speech Emotions Using Multilayer PerceptronClassifier,” Neural Comput. Appl., 2023.
  • Yang, Y., et al., “Attention-based Convolutional Recurrent NeuralNetworks for Speech Emotion Recognition,” IEEE Trans. Affect.Comput., vol. 13, no. 2, pp. 1016–1027.
  • Tsai,W.-C.,etal.,“MultimodalSpeech EmotionRecognition withTransformer-basedAudio-TextFusion,”Proc.Interspeech2022, pp. 2340–2344.
  • Feng, L., et al., “Contrastive Learning for Speech EmotionRecognition,” IEEE ICASSP, pp. 11201–11205.
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Mel-spectrogram deep learning speech emotion detection CNN LSTM Wav2Vec2 emotion classification human- computer interaction.

Powered by PhDFocusTM