Documents

A Model for Robust Chinese Parser

Categories
Published
of 36
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
Model for research paper
Transcript
  Computational Linguistics and Chinese Language Processingvol.1, no.1, August 1996, pp. 01-36Computational Linguistics Society of R.O.C. ASurveyonAutomaticSpeechRecognitionwithanIllustrativeExampleonContinuousSpeechRecognitionofMandarin Chin-Hui Lee*, Biing-Hwang Juang* Abstract For the past two decades, research in speech recognition has been intensivelycarried out worldwide, spurred on by advances in signal processing, algorithms,architectures, and hardware. Speech recognition systems have been developed for awidevariety of applications, ranging from small vocabulary keyword recognition overdial-up telephone lines, to medium size vocabulary voice interactive command andcontrol systems on personal computers, to large vocabulary speech dictation,spontaneous speech understanding, and limited-domain speech translation. In thispaper we review some of the key advances in several areas of automatic speechrecognition. We also illustrate, by examples, how these key advances can be used forcontinuous speech recognition of Mandarin. Finally we elaborate the requirements indesigning successful real-world applications and address technical challenges thatneed to be harnessed in order to reach the ultimate goal of providing an easy-to-use,natural, and flexible voice interface between people and machines. Keyword:hiddenMarkovmodeling,dynamicprogramming,speechrecognition,acousticmodeling,Mandarinspeechrecognition,spokenlanguagesystems 1. Introduction In the past few years a significant portion of the research in speech processing has goneinto studying practical methods for automatic speech recognition (ASR). Much of thiseffort has been stimulated by the Advanced Research Project Agency (ARPA), formerlyknown as D(efense)ARPA, which has funded research on three large vocabularyrecognition (LVR) projects, namely the Naval Resource Management (RM) task, the AirTravel Information System (ATIS) and the North American Business (NAB, previously * Multimedia Communications Research Lab, Lucent Technologies, 600 Mountain Ave., Murray Hill,Bell Laboratories, NJ 07974, U.S.A.E-mail: {chl, bhj}@research.bell-labs.com 1  known as the Wall Street Journal or WSJ) task. In addition, there is a worldwide activityin multi-lingual, large vocabulary speech recognition because of the potentialapplications to voice-interactive database access and management (e.g. ATIS & RM),voice dictation (e.g. discrete word recognizer [Jelinek 1985] and continuous speechrecognition such as the NAB/WSJ task) and limited-domain spoken language translation.The Philips SPICOS system and its extensions, the CSELT system (which is currently intrial) for Eurorail information services, the Cambridge University systems, and theLIMSI effort, are examples of the current activity in speech recognition research inEurope. In Japan, large vocabulary recognition systems are being developed based on theconcept of   interpreting telephony  and telephone directory assistance. In Taiwan andChina, syllable-based recognizers have been designed to handle large vocabularyMandarin dictation which is of practical importance because keyboard entry of Chinesetext requires a considerable amount of effort and training (e.g. [Lee  et al.  1993]. InCanada, the most notable research project is the INRS 86,000-word isolated wordrecognition system. In the United States, in addition to the research being carried out atAT&TandIBM, most of theeffortissponsored by ARPA, encompassing effortsby BBN(the BYBLOS system), CMU (the SPHINX systems), Dragon, Lincoln Laboratory, MIT(the Summit system and its extensions), SRI (the DECIPHER system), and many othersinthe ARPA Human Language TechnologyProgram. Abriefhistoryof automatic speechrecognition research can be found in the textbook on speech recognition by Rabiner andJuang (1993).Although we have learned a great deal about how to build practical and usefulspeech recognition systems, there remain a number of fundamental questions about thetechnology to which we have no definitive answers. It is clear that the speech signal isone of the most complex signals that we need to deal with. It is produced by a human'svocal system and therefore not easy to be characterized by a simple 2-dimensional modelof sound propagation. While there exist a number of sophisticated mathematical modelswhich attempt to simulate the speech production system, their modeling capability isstilllimited. Some of these models can be found in the seminal text by Flanagan (1964). Inaddition to the inherent physiological complexity of the human vocal tract, the physicalproduction system differs from one person to another. The speech signal being observedis different (even when produced by the same person) each time, even for multipleutterances of the same sequence of words. Part of the reason that automatic speechrecognition by machine is difficult is due to this inherent signal variability. In addition tothe vast inherent differences across different speakers and different dialects, the speechsignal is influenced by the transducer used to capture the signal, the channel used totransmit the signal, and the speaking environment that can add noise to the speech signal 2C.H. Lee, B.H. Juang   or change the way the signal is produced (e.g. the  Lombard effect   shown in [ Junqua  et al . 1993] ) in very noisy environments.There have been many attempts to find so called  distinctive features  of speech (e.g.[Fant 1973] ) which are invariant to a number of factors. Certain distinctive (phonetic)features, such as nasality and voicing, can be used to represent the place and manner of articulation of speech sounds so that speech can be uniquely identified by detecting theacoustic-phonetic properties of the signal. By organizing such knowledge in a systematicmanner, speech recognition can (in theory) be performed by first identifying and labelingthe sequence of feature vectors and then identifying the corresponding sounds in thespeech signal, followed by decoding the corresponding sequence of words using lexicalaccess to a dictionary of words. This has been demonstrated in spectrogram reading by ahuman expert who can visually segment and identify some speech sounds based onknowledge of acoustic-phonetics of English. Although the collection of distinctive fea-tures, intheory, offers a set of   invariant   features for speech recognition, it isnot generallyusedinmost speech recognitionssystems. Thisisduetothe fact that the set of distinctivefeatures are usually difficult to identify in spontaneous continuous speech and therecognition results are generally unreliable.A more successful approach to automatic speech recognition is to treat the speechsignal as a stochastic pattern and to adopt a statistical pattern recognition approach. Forthis approach we assume a source-channel speech generation model (e.g. [Bahl  et al. 1983] ) shown in Figure 1, in which the source produces a sequence of words,  W  .Because of uncertainty and inaccuracy in converting fromwordsto speech, we model theconversion from  W   to an observed speech waveform,  S  , as a noisy channel. Speechrecognition is then formulated as a  maximum a posteriori  (MAP) decoding problem, asshown in Figure 1. Instead of working with the speech signal  S   directly, one way tosimplify the problem is to assume that  S   is first parametrically represented as a sequenceof acoustic vectors  A . We then use the Bayes rule to reformulate the decoding problemasfollows,(1)   argmax W   G P ( W|A ) =   argmax W   G P (  A|W  )  × P ( W  ) ,where G is the set of all possible sequences of words,  P (  A|W  )is the conditional proba-bility of the acoustic vector sequence,  A , given a particular sequence of words  W  , and  P ( W  ) is the a priori probability of generating the sequence of words  W  . The first term,  P (  A|W  ), is often referred to as an  acoustic model , and the second term,  P ( W  ), is known asa language model. The noisy channel in Figure 1 is a model jointly characterizing the  A Survey on Speech Recognition3  speech production system, the speaker variability, the speaking environment, and thetransmission medium. Since it is not feasible to have a complete knowledge about sucha noisy channel, the statistical approach often assumes particular parametric forms for  P q (  A|W  ) and  P w  ( W  ), i.e. accordingto specific models. All the parametersofthe statisticalmodels (i.e.  q  and  w  ) needed in evaluting the acoustic probability,  P q  ( W|A ), and thelanguage probability,  P w ( W  ), are usually estimated from a large collection (the so-called training set  ) of speech and text training data. This process is often referred to as  modeltraining  or  learning . We will discuss this important issue later in the paper.There is some recent attempt trying to separate the speech production part from thesource-channel model by incorporating knowledge about the human speech productionmechanism. Knowledge about the transducersused for capturing speech and the channelused for transmitting speech can also be explicitly modeled. However, the effectivenessof such approaches is yet to be shown. Word SpeechSequence SignalSpeech WordSignalSequenceNoisyChannelChannelDecoding  Figure 1 Source-channel model of speech generation and speech recognition In the following sections we first briefly review the statistical pattern recognitionapproachtospeechrecognition. We thendescribethe two most important techniquesthathave helped to advance the state of the art of automatic speech recognition, namely hidden Markov modeling  (HMM) of the speech signal and  dynamic programming  (DP)methods for best path decoding of structural lexical networks. We next discuss severalASR systems and some real-world applications. Finally we address ASR system designconsiderationsand present a number ofASR research challengeswe need to overcome inorder to deploy natural human-machine interactive speech input/output systems. 2. PatternRecognitionApproach A block diagram of an integrated approach to continuous speech recognition is shown inFigure 2. The feature analysis module provides the acoustic feature vectors used tocharacterize the spectral properties of the time varying speech signal. The word-level 4C.H. Lee, B.H. Juang 

Medici Project

Jul 23, 2017
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks