Document Type : Research Paper

Authors

1 Computer Engineering Department, College of Engineering, University of Mosul, Mosul, Iraq

2 Computer Science Departement, Collage of Computer Science and Mathematics, University of Mosul, Mosul, Iraq

Abstract

Automatic Speech Recognition (ASR) is a tough task, with the existence of related noise and high unpredictability in a speech presenting the most severe problems. Especially with regard to the noise of speech impairments, whether due to disability or mispronunciation in children. Extraction of noise-resistant features to compensate for speech degradation due to noise impact has remained a difficult challenge in the last few years. This research investigated the impact of different wavelet generations for extracting speech features, then test the produced dataset from each technique with two types of deep learning techniques: deep long short-term memory (LSTM) and hyper deep learning model convolutional neural network with long short-term memory (CNN-LSTM). The result shows that the deep long short-term memory of MFCC has reached 93% as an accuracy while in the hyper deep learning model of CNN-LSTM the accuracy of MFCC was 91%, as the highest recorded accuracy which proves that MFCC would be the best feature extraction technique for our developed dataset.

Keywords

Main Subjects

X. Huang, A. Acero, H.-W. Hon, and R. Reddy, Spoken language processing: A guide to theory, algorithm, and system development: Prentice hall PTR, 2001.
[2]     D. Jurafsky and J. H. Martin, "Speech and Language Processing: International Version: an Introduction to Natural Language Processing," Computational Linguistics, and Speech Recognition, Pearson, 2008.
[3]     A. Kuamr, M. Dua, and T. Choudhary, "Continuous Hindi speech recognition using Gaussian mixture HMM," in 2014 IEEE Students' Conference on Electrical, Electronics and Computer Science, 2014, pp. 1-5.
[4]     S. Sadhu, R. Li, and H. Hermansky, "M-vectors: sub-band based energy modulation features for multi-stream automatic speech recognition," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6545-6549.
[5]     K. M. Indrebo, R. J. Povinelli, and M. T. Johnson, "Minimum mean-squared error estimation of mel-frequency cepstral coefficients using a novel distortion model," IEEE transactions on audio, speech, and language processing, vol. 16, pp. 1654-1661, 2008.
[6]     W. Han, C.-F. Chan, C.-S. Choy, and K.-P. Pun, "An efficient MFCC extraction method in speech recognition," in 2006 IEEE international symposium on circuits and systems, 2006, p. 4 pp.
[7]     W. Burgos, "Gammatone and MFCC features in speaker recognition," 2014.
[8]     X. Shi, H. Yang, and P. Zhou, "Robust speaker recognition based on improved GFCC," in 2016 2nd IEEE International Conference on Computer and Communications (ICCC), 2016, pp. 1927-1931.
[9]     T. L. Nwe and H. Li, "On fusion of timbre-motivated features for singing voice detection and singer identification," in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 2225-2228.
[10]   A. G. Katsiamis, E. M. Drakakis, and R. F. Lyon, "Practical gammatone-like filters for auditory processing," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2007, pp. 1-15, 2007.
[11]   X. Zhao and D. Wang, "Analyzing noise robustness of MFCC and GFCC features in speaker identification," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013, pp. 7204-7208.
[12]   M. Jeevan, A. Dhingra, M. Hanmandlu, and B. Panigrahi, "Robust speaker verification using GFCC based i-vectors," in Proceedings of the International Conference on Signal, Networks, Computing, and Systems, 2017, pp. 85-91.
[13]   P. S. R. Singh, N. Kaur, and P. Singh, "Speech Based Biometric System Using GFCC Features," Imperial Journal of Interdisciplinary Research, vol. 3, pp. 1156-1160, 2017.
[14]   P. Das and U. Bhattacharjee, "Robust speaker verification using GFCC and joint factor analysis," in Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 2014, pp. 1-4.
[15]   G. K. Liu, "Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech," arXiv preprint arXiv:1806.09010, 2018.
[16]   X. Zhang, X. Zou, M. Sun, and P. Wu, "Robust Speaker Recognition Using Improved GFCC and Adaptive Feature Selection," in International Conference on Security with Intelligent Computing and Big-data Services, 2018, pp. 159-169.
[17]   C. Kim and R. M. Stern, "Power-normalized cepstral coefficients (PNCC) for robust speech recognition," IEEE/ACM Transactions on audio, speech, and language processing, vol. 24, pp. 1315-1329, 2016.
[18]   M. Slaney, "An efficient implementation of the Patterson-Holdsworth auditory filter bank," Apple Computer, Perception Group, Tech. Rep, vol. 35, 1993.
[19]   A. Badi, K. Ko, and H. Ko, "Bird sounds classification by combining PNCC and robust Mel-log filter bank features," The Journal of the Acoustical Society of Korea, vol. 38, pp. 39-46, 2019.
[20]   A. A. Alasadi, R. R. Deshmukh, and S. D. Waghmare, "Review of Modgdf & PNCC Techniques for Features Extraction in Speech Recognition," in 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 2019, pp. 1-7.
[21]   H. M. S. Naing, R. Hidayat, R. Hartanto, and Y. Miyanaga, "Discrete wavelet denoising into mfcc for noise suppressive in automatic speech recognition system," International Journal of Intelligent Engineering and Systems, vol. 13, pp. 74-82, 2020.
[22]   L. Shang, Z. Lu, and H. Li, "Neural responding machine for short-text conversation," arXiv preprint arXiv:1503.02364, 2015.
[23]   C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith, "Transition-based dependency parsing with stack long short-term memory," arXiv preprint arXiv:1505.08075, 2015.
[24]   A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in 2013 IEEE international conference on acoustics, speech and signal processing, 2013, pp. 6645-6649.
 
[25]   O. Cheng, W. Abdulla, and Z. Salcic, "Performance evaluation of front-end processing for speech recognition systems," School of Engineering Report. The University of Auckland, Electrical and Computer Engineering, 2005.
[26]   W. H. Abdulla, "Auditory based feature vectors for speech recognition systems," Advances in Communications and Software Technologies, pp. 231-236, 2002.
[27]   M. Kleinschmidt, J. Tchorz, and B. Kollmeier, "Combining speech enhancement and auditory feature extraction for robust speech recognition," Speech Communication, vol. 34, pp. 75-91, 2001.
[28]   E. Yücesoy and V. V. Nabiyev, "Comparison of MFCC, LPCC and PLP features for the determination of a speaker's gender," in 2014 22nd Signal Processing and Communications Applications Conference (SIU), 2014, pp. 321-324.
[29]   Z. M. Dan and F. S. Monica, "A study about MFCC relevance in emotion classification for SRoL database," in 2013 4th International Symposium on Electrical and Electronics Engineering (ISEEE), 2013, pp. 1-4.
[30]   Sura Ramzi Shareef, Y.F.Al-I.,"Towards developing impairments arabic speech dataset using deep learning," Indonesian Journal of Electrical Engineering and Computer Science)ijeecs(,Vol.25,No.3, March , pp.2502-4752,2022DOI: 10.11591