An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning

Francisco Carlos M. Souza; Alinne C. Correa Souza; Carolina Y. V. Watanabe; Patricia Pupin Mandrá; Alessandra Alaniz Macedo

Research Article

An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning

by Francisco Carlos M. Souza, Alinne C. Correa Souza, Carolina Y. V. Watanabe, Patricia Pupin Mandrá, Alessandra Alaniz Macedo

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 177 - Issue 16

Published: Nov 2019

Authors: Francisco Carlos M. Souza, Alinne C. Correa Souza, Carolina Y. V. Watanabe, Patricia Pupin Mandrá, Alessandra Alaniz Macedo

10.5120/ijca2019919393

PDF

Francisco Carlos M. Souza, Alinne C. Correa Souza, Carolina Y. V. Watanabe, Patricia Pupin Mandrá, Alessandra Alaniz Macedo . An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning. International Journal of Computer Applications. 177, 16 (Nov 2019), 1-9. DOI=10.5120/ijca2019919393

                        @article{ 10.5120/ijca2019919393,
                        author  = { Francisco Carlos M. Souza,Alinne C. Correa Souza,Carolina Y. V. Watanabe,Patricia Pupin Mandrá,Alessandra Alaniz Macedo },
                        title   = { An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning },
                        journal = { International Journal of Computer Applications },
                        year    = { 2019 },
                        volume  = { 177 },
                        number  = { 16 },
                        pages   = { 1-9 },
                        doi     = { 10.5120/ijca2019919393 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2019
                        %A Francisco Carlos M. Souza
                        %A Alinne C. Correa Souza
                        %A Carolina Y. V. Watanabe
                        %A Patricia Pupin Mandrá
                        %A Alessandra Alaniz Macedo
                        %T An Analysis of Visual Speech Features for Recognition of Non-articulatory Sounds using Machine Learning%T 
                        %J International Journal of Computer Applications
                        %V 177
                        %N 16
                        %P 1-9
                        %R 10.5120/ijca2019919393
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

People with articulation and phonological disorders need exercise to execute sounds of speech. Essentially, exercise starts with production of non-articulatory sounds in clinics or homes where a huge variety of the environment sounds exist; i.e., in noisy locations. Speech recognition systems considers environment sounds as background noises, which can lead to unsatisfactory speech recognition. This study aims to assess a system that supports aggregation of visual features to audio features during recognition of non-articulatory sounds in noisy environments. Thehe methods Mel-Frequency Cepstrum Coefficients and Laplace transform were used to extract audio features, Convolutional Neural Network to extract video features, and Support Vector Machine to recognize audio and Long Short-Term Memory networks for video recognition. Report experimental results regarding the accuracy, recall and precision of the system on a set of 585 sounds was achieved. Overall, the results indicate that video information can complement audio recognition and assist non-articulatory sound recognition.

References

Yu D, Deng L. Automatic speech recognition: A deep learning approach. Springer Publishing Company; 2014.
Hughes S. Bullying: what speech-language pathologists should know; 2014.
Sakoe H, Chiba S. Readings in speech recognition. Chapter Dynamic Programming Algorithm Optimization for Spoken Word Recognition; 1990. p. 159–165.
Han W, fat Chan C, sing Choy OC, et al. An efficient mfcc extraction method in speech recognition. In: IEEE International Symposium on Circuits and Systems (ISCAS). IEEE; 2006.
Alatwi A, So S, Paliwal KK. Perceptually motivated linear prediction cepstral features for network speech recognition. In: 10th International Conference on Signal Processing and Communication Systems, ICSPCS 2016, Surfers Paradise, Gold Coast, Australia, December 19-21, 2016; 2016. p. 1–5.
Wang JC, Lee YS, Lin CH, et al. Robust environmental sound recognition with fast noise suppression for home automation. IEEE Transactions on Automation Science and Engineering. 2015 Oct;12(4):1235–1242.
Yan X, Li Y. Anti-noise power normalized cepstral coefficients for robust environmental sounds recognition in real noisy conditions. In: 2012 Fourth International Conference on Computational Intelligence and Communication Networks; Nov; 2012. p. 263–267.
Petridis S, Li Z, Pantic M. End-to-end visual speech recognition with lstms. arXiv preprint arXiv:170105847. 2017; 1–5.
Peelle JE, Sommers MS. Prediction and constraint in audiovisual speech perception. Cortex. 2015; 68 (Supplement C):169–181.
Noda K, Yamaguchi Y, Nakadai K, et al. Audio-visual speech recognition using deep learning. Applied Intelligence. 2015;42 (4): 722–737.
Zhou Z, Zhao G, Hong X, et al. A review of recent advances in visual speech decoding. Image and Vision Computing. 2014; 32(9): 590–605.
Gritzman AD, Aharonson V, Rubin DM, et al. Automatic computation of histogram threshold for lip segmentation using feedback of shape information. Signal, Image and Video Processing. 2016; 10(5): 869–876.
Heidenreich T, Spratling MW. A three-dimensional approach to visual speech recognition using discrete cosine transforms. arXiv preprint arXiv:160901932. 2016;1–27.
Stewart D, Seymour R, Pass A, et al. Robust audio-visual speech recognition under noisy audio-video conditions. IEEE transactions on cybernetics. 2014; 44(2):175–184.
Wu P, Liu H, Li X, et al. A novel lip descriptor for audio-visual keyword spotting based on adaptive decision fusion. IEEE Transactions on Multimedia. 2016; 18(3): 326–338.
Heckmann M. Audio-visual word prominence detection from clean and noisy speech. Computer Speech & Language. 2018; 48(Supplement C): 15–30.
Courant R, Hilbert D. Methods of mathematical physics. Vol. 1. Interscience; 1953.
Jothilakshmi S, Ramalingam V, Palanivel S. Unsupervised speaker segmentation with residual phase and mfcc features. Expert Systems with Applications. 2009; 36(6): 9799 – 9804.
Mather P, Tso B. Classification methods for remotely sensed data. CRC press; 2016.
Chien-Chang L, Shi-Huang C, Trieu-Kien T, et al. Audio classification and categorization based on wavelets and support vector machine. IEEE Transactions on Speech and Audio Processing. 2005; 13(5): 644–651.
Lecun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. In: Proceedings of the IEEE; 1998. p. 2278–2324.
Furlaneto DC. An analysis of ensemble empirical mode decomposition applied to trend prediction on financial time serie Mestrado em ciência da computação. Curitiba, PR, Brasil: Universidade Federal do Paraná; 2017.
Donahue J, Hendricks LA, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell. 2017; 39(4): 677–69.
Graves A, Jaitly N. Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning; 2014. ICML’14.
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2; 2014. p. 3104–3112; NIPS’14.
Cho K, Merrienboer BV, Bahdanau D, et al. Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning; 2014. ICML’14.
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011; 12: 2825–2830.
Wohlin C, Runeson P, H¨ost M, et al. Experimentation in software engineering: An introduction. 1st ed. Springer-Verlag Berlin Heidelberg; 2012.
Kuan TM, Jiar YK, Supriyanto E. Language assessment and training support system (latss) for down syndrome children under 6 years old. WSEAS Transactions on Information Science and Applications. 2010;7(8):1058-1067.
Hennequin A, Rochet-Capellan A, Dohen M. Auditory-visual perception of vcvs produced by people with down syndrome: Preliminary results. In: 17th Annual Conference of the International Speech Communication Association (Interspeech 2016); 2016.
Felix VG, Mena LJ, Ostos R, et al. A pilot study of the use of emerging computer technologies to improve the e_ectiveness of reading and writing therapies in children with down syndrome. British Journal of Educational Technology. 2017;48(2):611-624.

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Assistive technology health information speech recognition machine learning down syndrome.