ARABIC SPEAKER RECOGNITION SYSTEM USING GAUSSIAN MIXTURE MODEL AND EM ALGORITHM

Arabic language is a semantic language that has complicated difficulties when compared to English and other languages. In this paper an Arabic speaker recognition system has been developed for introducing conversion of the uttered Arabic speaker instantly after the utterance. The voice samples were recorded, the pre-processing activity detected to evaluate the voice parts from unvoiced, framing and rectangular window slides techniques has been used for segmentation of the Arabic Speech signals, followed by Mel Frequency Spectrum Coefficients (MFCC) for features extractions, The feature vectors are grouped for each spoken sample using VQLBG Algorithm and Gaussian Mixer Model (GMM) applied for classification and recognition an unknowing speaker through his uttered words which belong to specific cluster that is differenced form others clusters related to others Arabic speakers. This approach reported in providing 95.5% of recognition rate.


INTRODUCTION
Arabic is a semantic language that has complicated difficulties when compared to English language. Some of the difficulties encountered by a speech recognition system that are related to the Arabic language are fully described in literature references such as in [1], [2], [15], [41], [42]. Arabic language inherent mismatch between spoken and written language. Standard Arabic has 34 basic phonemes, of which six are vowels, and 28 are consonants [30]. Arabic has fewer vowels than English. It has three long and three short vowels, while American English has at least 12 vowels [41]. Arabic phonemes contain two distinctive classes, which are named pharyngeal and emphatic phonemes. These two classes can be found only in Semitic languages like Hebrew [42]. Great difficulties occur when several speakers with different dialects are to be recognized. Because the lack of standardization and lack of rules caused the spoken Arabic to be considerably varietal from one region to another. Arabic used in daily informal communication is not the same form of Arabic that is used in books, magazines, newspapers and on TV to broadcast the news. Isolate Arabic alphabet pronunciation is different from pronunciation the same alphabet connected in words. Lack of spoken and written training data is one of the main issues encountered by Arabic ASR researchers. These problems can be minimized by restricting the number of speakers, words and working with good acoustic condition. Also, by avoiding the complexities of fluent speech and working on modern standard Arabic to overcome different dialects [1], [15], [16], [28], [29].
Arabic speaker recognition have important application in daily life, there is a need for controlled access to certain information or places for security. For instances, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. Some of the possible applications of biometric systems include userinterface customization and access control such as airport check in, building access control, telephone banking or remote credit card purchases. Speech technology offers many possibilities for personal identification that is natural and nonintrusive [1], [2], [30].
Speech recognition has created a technological impact on society and is expected to flourish further in the area of human machine interaction. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that this paper describe, the system is able to add an extra level of security and other applications .
A conversation between people contains a lot of information besides just the communication of ideas. Speech also conveys information such as gender, emotion, attitude, health situation and identity of a speaker. The desire for a more secure identification system leads to the research in the of biometric recognition systems. There are two main properties of biometric features. Behavioral characteristics such as voice, signature are the result of body part movements. In the case of voice it merely shows the physical properties of the voice production organs. The articulatory process and the subsequent speech produced are never exactly same even when the same person utters the same sentence. Physiological characteristics refer to the actual physical properties of a person such as fingerprint, iris and hand geometry measurement [29], [30], [31].
Different approaches can be used in speech recognition such as HMM, ANN, SVM, GMM, Fuzzy Logic, hybrid systems and Combined Classifiers. The topic of this paper deals with speaker recognition that refers to the task of recognizing people by their voices [20], [22].
The remaining of this paper will discuss the System Over View in section II, system architecture in section III, the section IV deals with the results and experiment , the conclusion presented in section V and finally the references is assigned for section VI.

II. SYSTEM OVER VIEW
Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to many services. The speech wave itself contains linguistic information that includes meaning the speaker wishes to impart, the speaker's vocal characteristics and the speaker's emotion. Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech wave using computers or electronic circuits.

III. ASR SYSTEM ARCHITECHER
The system describes how to build a simple, yet complete and representative Arabic speaker recognition system. Such a speaker recognition system has potential in many applications. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that this in this paper built, the system is able to add an extra level of many knowledge in the Arabic speech field. The ASR System Architected represented here is included many phases that are described in details as follows:

A. Data Acquisition (Arabic Voiced Recorded)
Arabic Speech databases does not exist so our own database has been collected from Arabic speakers whose can speaks Arabic Language fluently and they recorded by the same recorder one by one to speak Arabic digits words from (whahid to ashrah) meaning (zero to ten) in the same environment and equipment. There are about 5000( 10 words X 10 repetitions X 50 speakers) time series 13 frequencies. Moreover the goal is to create sufficient data for each Arabic Speaker speech samples (Speaker1, Speaker2,…..… Speaker50).

B. Arabic Signal Pre-processing
Voice signal samples into the recognizer to recognize the speech directly, because of the non-stationary of the speech signal and high redundancy of the samples, thus it is very important to pre-process the speech signal for eliminating redundant information and extracting useful information. The speech signal pre-process step can improve the performance of speech recognition and enhance recognition robustness. There are five pre-processing techniques that can be used to enhance feature extraction. These include endpoint detection, preemphasis, silence removal, windowing and autocorrelation.

Pre emphasis
The speech generated from the mouth will loss the information at high frequency, thus it need the pre emphasis process in order to compensate the high frequency loss. Each frame need to be emphasized by a high frequency filter. The pre emphasis is a 1st order high pass filter. The speech will only remain the track section; it will be very simple for analyzing the speech parameters [10].

Dynamic time warping (DTW)
The warping between two time series can then be used to find corresponding regions between the two time series or to determine the similarity between the two time series. We desire to develop a dynamic time warping algorithm that is linear in both time and space complexity and can find a warp path between two time series that is nearly optimal [33], [34]. This paper introduced the fast DTW algorithm, which is able to find an accurate approximation of the optimal warp path between two time series. The time series are initially sampled down to a very low resolution. A warp path is found for the lowest resolution.

Noise Elimination
The biggest problem ever been in speech recognition systems is the noises in the environment. The pre-trained model for test might be inaccurate; the best result is got when we do the test in exactly the same room as we record the training data.

Silence Detection and Removal
Voice Activity Detection (VAD), is the technique used to scan the speech signal from the beginning and to its end for deleting all values under some specified value which is the noise values. Detecting the end of the word generally, it contains two methods in end point detection, one is based on entropy-spectral properties and another is according to double threshold method. In this paper we used double threshold technique.

Double Threshold
The Double threshold techniques is used for detecting endpoints of speech signal. Because the technique can detect a speech voice or unvoiced, if theshold1 > ratio(ration is a presetting Zero crossing rate) , then it's a speech signal , namely , it's been found the speech head . Vice versa, if theshold2 < ratio, then the speech signal is over, which means speech tail will be found. The signal between head and tail is the useful signal and thus the threshold in a big noise environment is adjustable as shown in figure (5). The more Generally speaking, author will check the endpoint of speech voice by average energy or the product of average amplitude value and zero crossing rate with the following equation (6). An average energy can be defined as: where x(n) is the speech signal, N the length of frame, m is the frame shift, w ( m ) is the windows function which expressed as: The signal windowing is to avoid truncation effect when framing, so windowing is necessary when extract every frames of sound signal. Windowing detailed will described in next section [6].  Zero crossing rate is another equation has been used during the detection, it indicates number of times that a frame of speech signal waveform cross through the horizontal axis. Zero crossing analysis is one of the simplest method in time domain speech analysis [9]. It can be defined as:

C. Feature Extraction
In speaker recognition technology, feature extraction is mainly used. Extracting features is a process of holding useful statistics of data from a speech signal while eliminating unwanted signals such as noise. Here, the conversion of the original acoustic wave into a tightly packed representation of the signal feature selection technique. The series of eigenvectors representing a close-packed speech signal is determined by a feature extraction method. The feature vectors extracted from the original signal in the feature extraction module prominence speaker-specific attributes and vanquish statistical redundancy [9]. This system will perform operations in three phases which are a preprocessing phase, the training phase, and decision phase.
The paper proposed Mel Frequency Cepstral Coefficients (MFCCS) for features extraction, it is perhaps the best known and most popular, recently used in most research. The popularity of this method can be explained by the low computational cost compared to FFT, LPCC and LPC based techniques [1], [2], [4], [6], [10], [19].
MFCC's are based on the known variation of the human ear's critical bandwidths with frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the important phonetically characteristics of speech and speaker. The steps of computing MFCCs is described in more detail as follows:

Frame Blocking
In this step the continuous speech signal is blocked into frames of N samples, with adjacent frames being separated by M (M < N). The first frame consists of the first N samples. The second frame begins M samples after the first frame, and overlaps it by N -M samples and so on. This process continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 20 ml sec windowing and facilitate the fast radix-2 FFT) and M = 30.

Rectangular Windows (Windowing)
The selection of different windows will determine the nature of the speech signal short-time average energy. During the increment the study found the length of window played very important role in the design filter. If the length of window is too long, the pass band of filter will be narrow. Otherwise, if the length of window is too small , the pass band of filter will be wide, and the signal can be represented sufficiently equally distributed. [19], [20]. The main lobe of hamming window is the widest, and it has the lowest side lobe level. The choice of the window is critical for analysis of speech signal, utilizing rectangular window is easily loss the details of the waveform, on the contrary, hamming window is more effective to decrease frequency spectrum leakage with the smoother low pass effect. Therefore, Rectangular window is more fitting for processing signals in time domain and hamming window is more used in frequency domain. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. The study defined the window as where N is the number of samples in each frame, then the result of windowing is the signal as follows: Typically the Hamming window is used, which has the form: Here, FR s R denotes the sampling frequency. The result after this step is often referred to as spectrum or periodogram.

Mel-frequency Wrapping
As psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the 'mel' scale. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. [20], [24], [25], [26], [27]. One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on the mel scale. That filter bank has a triangular band pass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval. The number of mel spectrum coefficients, K, is typically chosen as 20. For many reasons the paper applied the filter bank in the frequency domain, thus it simply amounts to applying the triangle-shape windows as in the figure No. (12) to the spectrum. A useful way of thinking about this mel wrapping filter bank is to view each filter as a histogram bin (where bins have overlap) in the frequency domain.

Cepstrum
In final step, the study the log mel spectrum back to time. The result is called the mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT).
Note that the study excluded the first component, , 0 c from the DCT since it represents the mean value of the input signal, which carried little speaker specific information [19].

Training Phase
Vector Quantization must able to estimate of the computed feature vectors. Storing every single vector that generate from the training mode is impossible, since these vectors are defined over a high dimensional space. It is often easier to start by quantizing each feature vector to one of a relatively small number of template vectors, with a process called vector quantization. VQ is a process of taking a large set of feature vectors and producing a smaller set of measure vectors that represents the centroids of the features. Furthermore of VQ, storing every single vector that we generate from the training is impossible. By using these training data features are clustered to form a codebook for each acoustic of word [12]. Finally Saving Trained Features with specific intent for using and editing without the aid of any programming algorithms.

Feature Matching
The VQLGB technique used for mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword and the combination of are called a codebook [1], [2], [8], [21].In feature matching of speech signal, Vector Quantization (VQ) technique that included plotting VQ codebook and also implementing a well-known algorithm developed by Linde, Buzo and Gray which was called LBG algorithm was used [1].

Figure 14: Mel Frequency Cepstrum for Speech Ethnan
The Euclidian distance is calculated for a given codebook to find least distance found denoted as VQ distortion. Similarly distortions are computed for the remaining feature vectors and a summed up. Same procedure is repeated for rest of the speakers. The least summation of the VQ distortions will identify the desired Speaker [7], [8], [9].

D. Gaussian Mixture Model GMM
A Gaussian mixture model (GMM) forms clusters as a mixture of multivariate normal density components. For a given observation, the GMM assigns posterior probabilities to each component density (or cluster). The posterior probabilities indicate that the observation has some probability of belonging to each cluster. A GMM can perform hard clustering by selecting the component that maximizes the posterior probability as the assigned cluster for the observation. GMM is appropriate method than k-means clustering when clusters have different sizes and different correlation structures within them. The paper designed a GMM model which used for Arabic speaker recognition with mixtures and diagonal covariance matrices. Gaussian mixtures are combinations of Gaussians, or-normal distributions. Feature vectors are displayed in d-dimensional feature space after clustering, they somehow look like Gaussian distribution. It means each matching cluster can be viewed as a Gaussian probability distribution and features fitting to the clusters can be best characterized by their probability values [35], [36], [37], [38]. The use of Gaussian mixture density for Arabic speaker identification can derived by two facts as follows [14], [15]: • Individual Gaussian classes are interpreted to represents set of acoustic classes. These acoustic classes represent speaker vocal tract information. • Gaussian mixture density provides smooth approximation to distribution of feature vectors in multi-dimensional feature space. A mixture of Gaussians can be written as a weighted sum of Gaussian densities. Recall the d-dimensional Gaussian probability density function (pdf) for the d-dimensional random vector x and given by the equation [26]: The parameters of this probability density function are the number of Gaussians, their weighting factors, and the mean vector and covariance matrix of each Gaussian function. To find these parameters and optimally fit of probability density function for a set of data, an iterative algorithm, the expectation-maximization (EM) algorithm can be used [3], [7], [19], [26].

E. RECOGNITION
GMM assumes vector space to be divided into specific components depending on clustering of feature vectors and frames the feature vector distribution in each component to be Gaussian. As initially the study has no idea about which vector belongs to which component a likelihood maximization algorithm is followed for optimal classification. For testing purpose the calculated posteriori probability of test utterance and the reference speaker maximizing Gaussian distribution is termed as identified of unknown speaker [17], [35], [38]. The words uttered by any Arabic speaker will belong to specific cluster that is differenced form others clusters related to others Arabic speakers. This is the base techniques chosen to be verified and recognized as it is assigned to the known speaker.

IV. EXPERIMENT AND RESULT
The study applied pattern recognition techniques to design speaker identification reference models for trained features and then can be recognize any sequences of acoustic vectors uttered by unknown speaker. VQLBG-based pattern recognition technique used to build speaker reference models from their vectors in the training phase and then can identify any sequences of acoustic vectors uttered by unknown speaker [1], [2], [3] [14], [18]. The GMM models used to compute the pairwise between the codewords for each speaker and trained vectors in the iterative process classifier. Train and test programs (which require three functions MFCC, VQLBG and GMM to simulate the training, testing and recognition procedures in Arabic speaker recognition system, respectively has been implemented effectively. The results compared in between and evaluated for gaining more efficient rate [8], [12], [14], [19], [21]. All experiments were implemented in Matlab 2014 and using Intel(R) Core(TM) i5 CPU M 370 @ 2.40GHz.

V. CONCLUSION
The results obtained using MFCC and VQLBG algorithm are evaluated carefully. An accuracy reported for VQLBG was 75.6%. Then the study applied GMM method for Arabic speaker identification and recognition an accuracy gained was 95.5%. It can be seen that the GMM model is the most attractive when compare with VQLBG Algorithm. The study concluded that the efficiency results has been obtained by GMM model comparing with VQLBG algorithm [14], [20], [35], [36], [37], [38].