Best Wavelet Filter for a Wavelet Neural Fricatives Recognition System

Direct recognition of phonemes in speaker independent speech recognition systems still cannot guarantee good enough recognition results. But grouping phonemes at first then trying to recognize the phoneme itself is a promising field. On the other hand wavelets are widely used in speech and speaker recognition systems, this is motivated by the ability of wavelet coefficients to capture important time and frequency features. In this work the effect of the wavelet filter type on the efficiency of a phoneme recognition system is investigated (specifically fricatives). The Probabilistic neural network was used as a pattern matching stage for its well known and power full ability in solving classification problems. It was found that the Daubechies wavelet family (generally from db15 to db23) is a good candidate for a fricatives phoneme recognition system that is based on wavelets as a feature extraction stage


Introduction
Automatic speech recognition (ASR) is a process by which a machine identifies speech. The machine takes a human utterance as an input and returns a string of words , phrases or continuous speech in the form of text as output. As ASR technology matures, the range of possible applications increases. However, a domain and speaker independent system able to correctly decode all speech found in communication between people into strings of words is not realistic with the current state of technology [1].
Each speaker differs from others with individual voice tract characteristics. So the acoustical realization of the same word or utterance pronounced by different speakers could differ very much. Even the same speaker can't pronounce the same word or phrase identically several times. So phonemic speech recognition should confront with big variation of the same phoneme and this causes degradation in phoneme recognition accuracy [2].
Currently, the majority of speech recognition systems are based on template or pattern recognition principles and methods. The main idea of these methods is that at first we prepare templates of those phonemic units that we want to recognize and later they are compared with tested feature vectors to find the closest match during recognition stage. Phoneme recognition is a problem which aims to find the class of phoneme to whom belongs part of speech signal. The simplest algorithm for template based phoneme classification is to compare features describing part of speech signal with template parameters of each phoneme and after that to prescribe to the class of phoneme that is closest under some selected criteria. Such recognition requires relatively long time and the errors inside of similar phonemes and outside such group are different. These drawbacks could be partly lessened by hierarchical phoneme recognition structure. Here recognition process is divided into two steps: 1. It is identified dependence of analyzed speech signal to the one of the main groups of phonemes (vowel, semivowel, consonant, etc.).
2. Then recognition inside this group of phonemes is carried on to make a final decision.
Theory of phonetics interprets phonemes as tree type phonetic hierarchy where nodes of the tree represents phonemes and are grouped into the some groups (vowels, consonants, etc.) example of such tree (for American English) is presented in Figure 1 [3]. Figure 1: The phoneme classification [3]. In this work a phoneme recognition system is built based on wavelet as a feature extracting front end stage and a neural network for matching. The main aim of this work is to explore the influence of the type of the wavelet filter on the recognition of phonemes, hopefully to find the filter type that is best suitable for this application. Experiments were performed on the largest set of consonants in the English language which is fricatives.

2.Motivation and previous work
Although many speech processing tasks, like speech and speaker recognition, reached satisfactory performance levels on specific applications, many problems remain an open research area.
Koizumi T. and others [4] used a structural phoneme recognition system. Feature vectors were obtained by filtering short-term speech signal spectrum with 16 filters evenly spaced in the Bark scale. Classifier has been realized using Multilayered Neural Networks or RNN (Recurrent. Neural Networks). During experiments phonemes were brought into 6 groups -voiced and unvoiced plosive consonants, voiced and un-voiced fricative consonants, nasal consonants and vowels.
Abdelatty A. and others [5,6] implemented a structural consonant recognition system. In this system classification is based on the logical rules obtained from the analysis of such phonetic -acoustic properties as spectrum, magnitude, place of articulation, voiciness/unvoiciness and duration. Experiments were performed using TIMIT speech corpora. Phonemes were brought into such groups as plosive and fricative consonants, affricates. Further they were brought into voiced and unvoiced and even further into labials, palatals, alveolar, etc.
Juneja V. and Espy W. [7] performed experiments with the recordings from TIMIT corpora. In these experiments they compared performance of HMM based approach and hierarchical classification methods. Speech signal was classified into 5 classes: silence, vowels, sonorants, fricative and plosive con-sonants.
As could be seen, phoneme recognition based on the phonetic -acoustic knowledge is sufficiently applicable and perspective method that could allow achieving higher general speech recognition accuracy level.
Historically, LPC, LPCC, and MFCC speech features dominated the speech and speaker recognition areas in consequent periods. Other features like, PLP, ACW, wavelet-based features, did not gain widespread practical use, often due to their relatively more sophisticated computation. Nowadays many earlier computational limitations are overcome that opens possibilities for revaluation of the traditional solutions when speech features are selected for a specific task [8].
wavelet transform as a promising non-linear tool for signal analysis that has been used widely in phoneme recognition systems. The indications are that the Wavelet Transform and its variants are useful in speech recognition due to their good feature localization but furthermore because more accurate (non-linear) speech production models can be assumed. The analysis of the power in different frequency bands offers potential for the distinguishing of phonemes [9].
The main algorithm (wavelet) dates back to the work of Stephane Mallat in 1988. Since then, research on wavelets has become international. Wavelets and wavelet packets have been widely used in speaker ,speech and phoneme recognition this is seen in several past works as in [8], [10], [11]and [12].
In the conclusion of a comparative study in [13] between the use of wavelet and the traditional well known Mel Frequency Cepstral Coefficients (MFCC) it is mentioned that using wavelet may bring potential in automatic speech recognition. In addition, the well-known LFCC, MFCC and PLP, whose performance is well studied, were employed as reference points.
As a conclusion from all the above the filter-bank design is an open study point in the field of feature extraction as a front end of any speech/phoneme recognition system.
On the other hand the use of Artificial Neural Network (ANN) in general and specifically the Probabilistic Neural network as a decision, template matching or a classification stage is found in many past work convolving speaker, speech and phoneme recognition systems [1], [15], [16], [17] and [18].

3.Phoneme Classification
In this section the acoustic phonetic classification is discussed in general with a special concentration on fricatives. The recently developed TIMIT database [19] is ideal for evaluating phone recognizers. It consists of a total of 6300 sentences recorded from 630 speakers. Most of the sentences have been selected to achieve phonetic balance, and have been labeled at MIT. Lee K.& Hon H. [20] studied this data and labeled a total of 64 possible phonetic labels. From this set, 48 phones were selected. All "Q" (glottal stops) were removed from the labels. Also 15 allophones were identified, and folded them into the corresponding phones. Table 1 [20].
The fricatives form the largest set of consonants in the English language which has nine standard fricative consonants, namely: the voiceless fricatives which include the labiodental /f/ as in leaf, the linguo-dental /th/ as in teeth, the alveolar /s/ as in lease and the palatal /sh/ as in leash and their voiced cognates /v/ as in leave, /dh/ as in seethe, /z/ as in Lee's and /zh/ as in azure. The ninth fricative is the /h/ which is considered also a semivowel. These consonants can be distinguished by English speaking listeners in identical phonetic contexts, regardless of whether these contexts are meaningful utterances or nonsense syllables. Therefore, the features needed for such discrimination can only reside in the acoustical signal [21].

4.Wavelet and Wavelet Packets
In the very most of ASR solutions, filter banks are used for parameterization of speech into acoustic features. Spectral analysis of the speech signal is the most appropriate method for extracting information from speech signals. DWT has been successfully used in many signal processing applications including speech for the spectral analysis of data. [8] According to the multi-resolution theory, any wavelet ψ that generates an orthogonal basis of L 2 (R) is characterized, by means of a filter bank construction, by a pair of discrete filters consisting of a high-pass (HPF) and a low-pass one (LPF) followed by sub-sampling by two to reduce redundancy. These filters belong to a particular class of filters, called conjugate mirror filters, cascading these filters produces a fast discrete wavelet transform.
Wavelet packet functions generalize the filter bank tree that relates wavelets and conjugate mirror filters. In the decomposition with the wavelet packet transform, the lower, as well as the higher frequency bands are decomposed giving a balanced binary tree structure. Such a tree is illustrated in Figure 2. To each node in the tree, a wavelet packet space W p j is associated, where j is the depth, and p is the number of the nodes to the left of this particular node at the same depth. Figure 2 illustrates 8 wavelet packets W p j at the depth j=3 [8].

The Probabilistic Neural Network
Artificial neural networks ("ANN") are adaptive models with a network-like structure consisting of a large number of processing units, called neurons.
In the present work a special type of neural network is used called Probabilistic Neural Network (PNN) (see [22] for details) .
The use of the probabilistic neural network (PNN)in this work is motivated by its well known power full classification characteristics. So it is used in this work to classify the input phoneme segment (after extracting its features).

Al-Rafidain Engineering
Vol. 19 No. 6 December 2011 143 Figure 3 shows the architecture of the probabilistic neural network used in this work.

Speech Corpus
The speech corpus used to find the best type of wavelet filter in the proposed phoneme recognition system is the standard American English TIMIT provided by Linguistic Data Consortium [19]. TIMIT is an acoustic-phonetic database including 6300 sentences and 630 speakers who speak English. The audio format is PCM, the audio samples are quantized in 16 bit, the recordings are single-channel, the mean duration is 3.28 sec and the standard deviation (st. dev.) is 1.52sec. From all the available data in the TIMIT corpus two arbitrary subsets of speakers are used in this work. The male speaker's subset contained 70 speakers and the female speaker's subset contained 70 speakers too. There are 10 speech files for each speaker; two of the files have the same linguistic content for all speakers, whereas the remaining eight files are phonetically diverse.
For the evaluation of the proposed system 10 speakers were selected arbitrary from the TIMIT corpus, six of them were used for training and the other four for testing. First phonemes were extracted from each speech file and grouped according to its type, as mentioned earlier in this work we are interested in fricatives(/f/,/th/,/s/,/sh/,/v/,/dh/ and/z/), according to [20] /zh/ is grouped with /sh/ so it was not include in this work.

7.System Architecture for Phoneme Recognition
The proposed system has two main stages (as any recognition system). But in this work (differing from any known phoneme recognition system) Firstly a preprocessing and feature extracting stage which is the wavelet packet. Followed by the classification stage and that is the Probabilistic Neural Network (PNN). Next is the procedure used to extract the features of the fricative phonemes, train the neural network and finally test the system.
The feature extraction as a procedure is the same for the training phase and for the testing phase. Each phoneme file is applied to a wavelet packet tree of a depth of seven (j=7) this provide a total of 128 frequency sub bands. Due to the compact support of wavelet, no Hamming window or other window is required and there is a single output from the wavelet tree every 8 msec. This is because the down sampling by two at every stage in the wavelet packet. The frequency resolution in accordance of the wavelet tree is 125Hz (16000Hz, which the sampling frequency of the input speech signal, divided by 2 7 ), see Figure 4.
where W p j+7 f (i) is the i-th coefficient of the wavelet packet transform of a signal f at node W p j+7 of the wavelet packet. As a result, a matrix of 128 rows by N columns is obtained for each phoneme, where N depends on the duration of the phoneme file (N= duration in seconds /8msec.). Each vector (column) of this matrix is a feature vector representing this phoneme. This N vectors feature matrix has a redundancy in it. This redundancy is removed using clustering processes. The clustering processes can be performed using any clustering algorithm. However, the most popular and the simplest clustering algorithm (the generalized Lloyd algorithm) (GLA) is used. The algorithm is also known as Linde-Buzo-Gray algorithm (LBG) according to its inventors or the K-mean clustering algorithm. The K-mean clustering algorithm reduces the size of this matrix to (128*32). This overall processes is repeated seven times for all the phonemes (/f/,/th/,/s/,/sh/,/v/,/dh/ and/z/). The clustering algorithm is used in the training phase only. At this point seven matrices (one for each phoneme) are obtained. These matrices are then concatenated to form one matrix which is used to train the PNN which has an input of 128 nodes and an output of 7 which is the number of classes of phonemes (Fricatives) used. At this point the training process is completed.
In the testing phase the phoneme speech file (7 files, one for each phoneme /f/,/th/,/s/,/sh/,/v/,/dh/ and/z/) is passed through the same stages mentioned above but not the clustering stage to extract the features. After the feature matrix is obtained it is entered to the neural network which produce an output for each vector (column) in this matrix. Then the recognition rate is found by dividing the correct recognitions by the total number of input vectors. Figure 5 illustrate the architecture of the proposed system for both the training phase and the testing phase.
To this point the system is completed. But this system is designed for a particular type of wavelet filter which was used in building the wavelet packet tree. Keeping  that the main purpose of this work is to find the best wavelet filter to be used in phoneme recognition systems the all above procedure is repeated for all the filters under examination. Figure 5: Architecture of the system used for phoneme recognition (training phase and testing phase)

8.Experiments and Results
After training the PNN, the network is tested with the same training data to check the system. It was found that the recognition rates were between 99.11% and 70.54% as shown in Table 2. This procedure is done for every type of wavelet filter, as a result, the training and testing phase is repeated for 85 times. The types of the wavelet filters that were examined are: Daubechies  The above table shows that the training is sufficient. Know the system is tested with new data not used for training. This testing was carried out for all the types of wavelet filters to find the wavelet filter that gives the best recognition rate. The results of the testing phase is shown in Table 3.
Table3: The recognition rates for the testing phase (continued). It is seen in the results that the first best five wavelet filters are Daubechies 21,23,22,18 and 15. Another point to notice is that the recognition rate is rather low. So the results of the first best five filters is further examined as seen in Table 4. From the previous table, there are three major problems, the phoneme /dh/ is falsely recognized as /th/, the phoneme /f/ is falsely recognized as /th/ and the phoneme /z/ is falsely recognized as/s/.

Conclusions
The effect of the type of the wavelet filter on phoneme recognition in a phoneme recognition system based on wavelet and neural network was examined. From the results it is noticed that the Daubechies wavelet family is a good candidate for phoneme recognition system that are based on wavelets as a feature extraction stage, generally from db15 to db23. For the proposed system there was a problem of a false recognition of a phoneme specifically as another one (/dh/ as/th/,/f/ as /th/ and /z/ as /s/) this led to a degradation in the total recognition rate of the system. But keeping in mind that the main goal of this work is to find the best wavelet filter the results is still very useful in building any wavelet based phoneme recognition system. On the other hand these false recognitions are between similar pronounced fricatives and in most word are easily pronounced as each other according to the person. Therefore, if this taken into consideration and by adding this false values to the true values, for example for the db21 case, the total recognition rate can reach as high as 75.29% which is an acceptable value compared to recent phoneme recognition systems. For example 77% to 80% as in [21] 10.References