N-gram Models for Language Detection

of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
information about N grams
  N-gram models for language detection Carlos  Ramisch M2R Informatique - Double diplˆome ENSIMAG – UJF/UFRIMAUE Ing´eni´erie des Langues et de la Parole December 21, 2008 Abstract In this document we report a set of experiments using  n-gram lan-guage models   for automatic  language detection   of text. We will startwith a brief explanation of the concepts and of the mathematics behindn-gram language models and discuss some applications and domains inwhich they are widely used. We will also present an overview of relatedwork in language detection. Then, we will describe the resources used inthe experiments, namely a subset of the  Europarl   corpus and the  SRILM toolkit. We will then perform a toy experiment in order to explain indetail our methodology. Afterwards, we will evaluate the performance of different language models and parameters through a precision measurebased on the perplexity of a text according to a model. We concludethat n-gram models are indeed a simple and efficient tool for automaticlanguage detection. Contents 1 Introduction 22 Related Work 33 Resources 4 3.1 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Language Model Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Experimental Setup and Results 6 4.1 Toy Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Extensive Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.2.1 Experiment 1: Model Type . . . . . . . . . . . . . . . . . . . . 94.2.2 Experiment 2: Size of Training Set . . . . . . . . . . . . . . . . 104.2.3 Experiment 3: Order of the Model . . . . . . . . . . . . . . . . 104.2.4 Experiment 4: Size of Test Sentences . . . . . . . . . . . . . . . 11 5 Conclusions 12 1  1 Introduction In this report, we are interested in creating a system for  language identification  .We define it as special type of text categorisation where, given a sample of text,the system needs to define in which natural language this text was written.For instance, If the following sentences are input into the system, the expectedoutput would be German, English and French: Das Protokoll der gestrigen Sitzung wurde verteilt.The Minutes of yesterday’s sitting have been distributed.Le proc`es-verbal d’hier a ´et´e distribu´e. The intuition behind the solution of this problem is that we are able tobuild a model for each language that we would like to identify. Then, we cancompare an input text with each one of the models and only keep the languagefor which the distance between the text and the model is minimal. For example,if a given sentence is very close to the model of French and far enough of themodels for other languages, the system will say that this sentence was written inFrench. In this work, we use  n-gram language models   and our distance measureis  perplexity  . Formally, we can state our underlying hypothesis as follows: Main Hypothesis  Given a set of languages  LS   and a text  T  , we can detectthe language of   T   by taking the language  ls i  ∈  LS   for which the perplexity of  T   according to an n-gram model of   ls i  is minimal.An  n-gram   is any sequence of (contiguous) words in a text. N-gram languagemodels are a class of models that assign a probability for each n-gram of alanguage, for a fixed value of   n , based on the Markov assumption that theoccurrence of a word only depends on a short history (the  n − 1 previous words).This probability may be estimated from a corpus (the  training corpus  ), usingthe Maximum Likelihood Estimator (MLE), e.g. in section 4.1, the trigram monsieur le pr´esident   appears 7,738 times in a corpus containing  N   = 428 , 106trigrams, and therefore it has a probability of  P  MLE  ( w i ...w n ) =  C  (monsieur le pr´esident) N   = 7738428106 = 1 . 8%However, an n-gram that does not appear in the training corpus will havea probability of 0, under the assumption that the training corpus contains allpossible n-grams for a given language. This assumption is not true, due to thesparseness of language and to the limited availability of corpora. Hence, wewould like to assign a low (but non-zero) probability for unseen n-grams. Chenand Goodman [1996] present a set of statistical  smoothing   techniques for n-gramlanguage models that can solve this problem. The technique used in this reportis Good-Turing discounting with back-off smoothing [Katz, 1987], according tothe inductive definition: P  BO ( wi...w n ) =  (1  − d w 1 ...w n − 1 ) P  MLE  ( w i ...w n ), for seen n-grams α w 1 ...w n − 1 P  BO ( w 2 ...w n ), for unseen n-grams2  The factor  d  is calculated using Good-Turing discounting and the normali-sation factor  α  makes the probability mass left over in the discounting processto be distributed among unseen n-grams. This is explained in detail by Katz[1987] and by Manning and Sch¨utze [1999].The  perplexity   of a sequence of words  w 1  through  w L  (the  test set  ) accordingto an n-gram model  BO  is defined as 2 H  ( w 1 ...w L ,BO ) , where  H   is the  cross-entropy   between the text and the model, according to the information theoryformula: H  ( w 1 ...w L ,BO ) = 1 L − n + 1 L − n  l =0 log 2 P  BO ( w 1 ...w l + n ) , where the test data contains L − n +1 n-grams and the probabilities are estimatedaccording to the back-off probabilities described above.The initial motivation for n-gram models comes from Speech Processing: n-gram language models are especially useful for speech recognition. Nowadays,n-gram models are used in a wide range of NLP applications. Generative ma-chine translation systems use it explicitly to verify the fluidity of the translationhypotheses, while descriptive translation models generally have one or more fea-tures corresponding to an n-gram language model. Reranking techniques for theparametrisation of the translation model in MERT-like learning algorithms of-ten rely on a language model in order to distinguish the best among a list of the n-best translations. Another task that often uses n-gram models is Part-Of-Speech (POS) tagging. State-of-the-art POS taggers are trained over annotatedcorpora and learn some kind of conditional n-gram model in which the proba-bility of a tag sequence  t 1 ...t m  for a given sentence  w 1 ...w m  depends on thecurrent word but also on the  n  previous tags (Markov models): P  ( t 1 ...t m | w 1 ...w m ) = m  i =1 P  ( w i | t i ) P  ( t i | t i − n +1 ...t i − 1 )The remainder of this report is structured as follows: section 2 discussessome related work on language identification, section 3 describes the resourcesused further in section 4, where we describe and discuss the results of eachexperiment. The overall conclusions and final comments are presented in section5. 2 Related Work Most of the literature in automatic language identification uses some variationof character-based n-gram model. Cavnar and Trenkle [1994] use a language n-gram profile  , built upon the most frequent character n-grams in a text, with nrunning from 1 to 5. Then, they use an  ad hoc   “out-of-place” ranking distancemeasure to classify a newly arrived text into one of the existing profiles. Theyreport 90% to 100% precision using a 300 n-grams profile, in detecting a text of 300 characters, and 95% to 100% for longer texts.3  Dunning [1994] states that real-life applications need a precise language clas-sification for texts up to 20 characters, showing the practical limitations of meth-ods based on characteristic sequences and common words. He also criticises theapproach of Cavnar and Trenkle [1994] since it depends on tokenisation anduses an empirical measure whose statistical properties are difficult to derive.Therefore, he proposes a method based on Markov chains and Bayesian models,similar to the one proposed here, that uses a log-probability sum as distancemeasure between a text and a language model.A comparison of trigram-based and short-word-based techniques for lan-guage identification is described by Grefenstette [1995]. He uses a non-smoothedMLE probability distribution, and then compares one technique in which thedistribution runs over trigrams and another in which the distribution runs over“common words” (5 letters or less, frequent words). The results show thattrigrams are better for short sentences and that when larger sentences are con-sidered, both methods perform equally well 1 .A short survey on language identification containing some pointers to otherrelated work can be found in Muthusamy and Spitz [1997]. Given that languageidentification is an extremely simple problem compared to other NLP fields, ithas the status of a “solved problem” that could be thought in undergraduatecourses [McNamee, 2005]. The solution presented in this report aims at achiev-ing state-of-the-art accuracy, that is to say correctly identify near to 100% of the sentences in a test set even for short sentences, using a relatively simpleapproach. 3 Resources In order to verify our hypothesis, we will perform a series of experiments usingn-gram language models. We will thus need two resources: a multilingual corpusfor training and testing the models, and a language model toolkit. The followingtwo sections present and describe these resources, that will be used throughoutthe remainder of this report. 3.1 Corpus Data-driven, statistical or empirical methods on Natural Language Processingare different names for the same very active research field. During the lastdecades, the NLP community proposed many techniques that try to extractrelevant information from corpora, mainly based on probabilistic tools. Thisreport studies a stochastic approach to language detection and hence uses amultilingual corpus upon which the n-gram language models are built.The  Europarl   corpus [Koehn, 1995] is a collection of parallel texts containingthe transcription of official speeches held at the Parliament of the European 1 This tool can be tested on 4
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks