of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163 Volume 1 Issue 8 (September 2014) _________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved
    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 224 Speaker Identification based on GFCC using GMM Md. Moinuddin Arunkumar N. Kanthi  M. Tech. Student, E&CE Dept., PDACE Asst. Professor, E&CE Dept., PDACE  Abstract: The performance of the conventional speaker identification system degrades drastically in presence of noise. The ability of human ear to identify the speaker’s identity in noisy environment motivates us to use an auditory based  feature called gammatone frequency cepstral coefficient (GFCC). The GFCC is based on gammatone filter bank, which models the basilar membrane as a series of overlapping band pass filters. The speaker identification system using the GFCC features and GMMs has been developed and analysed using TIMIT and NTIMIT databases. The  performance of the system is compared with the baseline system using the traditional MFCC features.   The results  show that the GFCC features has a good recognition performance not only in clean speech environment, but also in  noisy environment.  Keywords— Auditory based feature, Gammatone Frequency Cepstral Coefficient (GFCC), MFCC, GMM, EM - algorithm I.   INTRODUCTION Speaker identification determines from which of the enrolled speakers the given utterance has come. The utterance can  be constrained to a known phrase (text-dependent) or totally unconstrained (text-independent). It consists of feature extraction, speaker modeling and decision making. Typically, extracted speaker features are Mel-frequency cepstral coefficients (MFCCs). For speaker modeling, Gaussian mixture models (GMMs) are widely used to describe feature distributions of individual speakers. Recognition decisions are usually made based on likelihoods of observing feature frames given a speaker model. The poor performance of MFCCs in noisy or mismatched condition can be attributed to the use of triangular filters for modelling the auditory critical bands. To model cochlear filter more accurately, Gammatone filters are used instead of the triangular filters and the extracted features are called gammatone frequency cepstral coefficients (GFCCs). II.   THE   SYSTEM   MODEL Speaker identification system consists of two parts: front-end & back-end. The front-end of the system is a feature extractor while the back-end consists of a classifier and a reference database. Front-End Back-End Figure 1: Architecture of speaker identification system   Database Train utterances GFCC Extractor GMM Modeling Test utterance ML Classifier Identification  result     International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 225   The main task of the front-end is to extract features from a speech signal. The aim is to sufficiently represent the characteristics of the speech signal with reduced redundancy. Features are extracted based on frames. One feature vector is calculated for every frame. After feature extraction, the sequence of feature vectors is passed to the back-end of the speaker identification system. Based on the feature vectors, the back-end of the system selects the most likely utterance out of all the possibilities from the reference database. After training, the statistical models are stored in the database. When an unknown utterance is presented, feature vectors are obtained. The classifier calculates the maximum log likelihood based on the models and decides the most likely utterance. III.   GFCC   EXTRACTION The GFCC features are based on the GammaTone Filter Bank (GTFB). The feature vectors are calculated from the spectra of a series of windowed speech frames. The figure below shows the block diagram of GFCC extraction. Figure 2: Block Diagram of GFCC extraction   Speech utterance   Pre-emphasis   Framing & Windowing   DFT & |.| 2   GTFB   Logarithmic compression   DCT   GFCC features   Pre-emphasis stage:  The high frequency components of a speech signal have low amplitude as compare to low frequency components due to radiation effect of lips. In order to spectrally flatten the speech signal i.e. to obtain similar amplitude for all frequency components, the speech signal is passed through a Pre-emphasis filter, which is a first order FIR digital filter, which can eliminate the lips spectral contribution effectively. The speech after pre-emphasis sounds much sharper. The transfer function of the pre-emphasis filter is given by the following equation … (1)   Where ‘a’ is a constant, it has a typical value of 0.97. Fig 3: Pre-emphasis operation    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 226   Framing & Windowing: Speech signal is non-stationary i.e. its statistical characteristics varies with time. Since the glottal system cannot change immediately, speech can be considered to be time-invariant over short segments of time (20-30 ms). Therefore speech signal is split into frames of 20ms. When the signal is framed, it is necessary to consider how to treat the edges of the frame otherwise the edges add. Therefore a windowing function is used to tone down the edges. The choice of window must be the one which has narrow main lobe and attenuated side lobes. Therefore hamming window is the preferred choice. The Hamming window is given by equation … (2)   Fig 4: windowing operation As a consequence of windowing, the samples will not be assigned the same weight in the following computations & for this reason it is sensible to use an overlap (10 ms).  DFT: The windowed frame is transformed using a Discrete Fourier transform and the magnitude is taken because phase does not carry any speaker specific information. Fig 5: DFT operation Gammatone filter banks stage: The Gammatone filter bank consists of a series of band-pass filters, which models the frequency selectivity property of the basilar membrane.The impulse response of each filter is given by the equation (1M)  … (3)   Where ‘a’ is the constant (usually equals to 1). ‘n’ is the filter order ( here n=4). ‘’ is the phase shift. ‘’ is the center frequency and ‘b m ’ is the attenuation factor of the filter, which is related to the band of the filter, and is decisive factor of impulse response decay rate. Fig 6: Frequency response of 64 channel gammatone filter bank    International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163   Volume 1 Issue 8 (September 2014 ) _________________________________________________________________________________________________ © 2014, IJIRAE- All Rights Reserved Page - 227   The centre frequency of m th  Gammatone filter can be determined by the equation + … (4)   Where f   L  and f   H   are the lower and upper frequencies of the filter bank. The bandwidth of each filter is described by an Equivalent Rectangular Bandwidth (ERB). The ERB is a psychoacoustic measure of the width of the auditory filter at each point along the cochlea. The equation for the ERB ERB () = 24.7 … (5)   The bandwidth of each filter is described by ERB as  b m  = 1.019 ERB () … (6)   The FFT magnitude coefficients are binned   by correlating them with each gammatone filter i.e., each FFT magnitude coefficient is multiplied by the gain of the corresponding filter and the result is accumulated. Thus, each bin holds the spectral magnitude in that filterbank channel. … (7)   Fig 7: Filter bank processing  Logarithmic compression & Discrete cosine Transformation (DCT) stage:  The logarithm is applied to each of the filter output to simulate the human perceived loudness given certain signal intensity and to separate the excitation (source) produced by the vocal cords and the filter that represents the vocal tract. Since the log-power spectrum is real, Discrete Cosine Transform (DCT) is applied to the filter outputs which produces highly uncorrelated features. The envelope of the vocal tract changes slowly, and thus presents at low quefrencies (lower order cepstrum), while the periodic excitation are at high quefrencies (higher order cepstrum). , where 1   … (8)  
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks