233_PapeBLEU Evaluation of Machine-Translated English-Croatian Legislationr

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
BLEU Evaluation of Machine-Translated English-Croatian Legislation
  BLEU Evaluation of Machine-Translated English-Croatian Legislation Sanja Seljan 1 , Tomislav Vi č i ć 2 , Marija Brki ć 3   1 University of Zagreb, Faculty of Humanities and Social Sciences, Department of Information and Communication Sciences, Ivana Lu č i ć a 3, 10000 Zagreb, Croatia 2 Freelance translator 10000 Zagreb, Croatia 3 University of Rijeka, Department of Informatics Omladinska 14, 51000 Rijeka, Croatia E-mail:,, Abstract This paper presents work on the evaluation of online available machine translation (MT) service, i.e. Google Translate, for English-Croatian language pair in the domain of legislation. The total set of 200 sentences, for which three reference translations are provided, is divided into short and long sentences. Human evaluation is performed by native speakers, using the criteria of adequacy and fluency. For measuring the reliability of agreement among raters, Fleiss' kappa metric is used. Human evaluation is enriched by error analysis, in order to examine the influence of error types on fluency and adequacy, and to use it in further research. Translation errors are divided into several categories: non-translated words, word omissions, unnecessarily translated words, morphological errors, lexical errors, syntactic errors and incorrect punctuation. The automatic evaluation metric BLEU is calculated with regard to a single and multiple reference translations. System level Pearson’s correlation between BLEU scores based on a single and multiple reference translations is given, as well as correlation between short and long sentences BLEU scores, and correlation between the criteria of fluency and adequacy and each error category. Keywords:  BLEU metric, English-Croatian legislation, human evaluation 1.   Introduction Evaluation of machine translation (MT) web services has gained considerable attention lately, because of their more widespread usage in accessing information in a foreign language by students, researchers, patients, teachers, everyday users, etc. Comparisons between human and different automatic metrics, error analysis, suggestions for improvement have become a logical follow-up. Although results of MT web services oscillate from “laughably bad” to “a tremendous success” (Hampshire, 2010), most of them aim to achieve reasonably good quality (although the notion of “good quality” is a question  per se ). An assessment of machine translated text is important for product designers, professional translators and post-editors, project managers, private users, as well as in education and research. The issue of “good” translation is often discussed, as well as the consensus on the agreement on various evaluation criteria (fluency, adequacy, meaning, severity, usefulness, etc.) and subjective evaluation approach. Evaluation in MT research and product design can be done with the aim of measuring system performance (Giménez and Màrquez, 2010; Lavie and Agarwal, 2007) or with the aim of identifying weak points and/or adjusting parameter settings of different MT systems or of a single system through different phases (Denkowski and Lavie, 2010a; Agarwal and Lavie, 2008). Moreover, the identification of weak points might contribute to quality improvement, especially for less-resourced languages and languages with rich morphology. 2.   Related Work Google Translate (GT), being a free web service, is included in almost every research on MT evaluation, especially because it offers translation from and into less widely spoken languages. In the study presented by Khanna et al. (2011), a text from one pamphlet on the importance of health care for people with limited English proficiency is selected. The text is GT-translated from English into Spanish and then compared with human professional translation. The study presented by Shen (2010) compares three web translation services – GT, i.e. a statistically-based translation engine, Bing (Microsoft) Translator, i.e. a hybrid statistical engine with language specific rules, and Yahoo Babelfish, i.e. a traditional rule-based translation engine. While GT is preferred for longer sentences, and language combinations for which huge amount of bilingual data is provided, Microsoft Bing Translator and Yahoo Babelfish give better results on phrases having less than 140 characters and on some specific language pairs (e.g. Bing Translator on Spanish, German, and Italian; Babelfish for East Asian languages). In Garcia-Santiago and Olvera-Lobo (2010) the quality of translating questions from German and French into Spanish by several MT services (GT, ProMT and WorldLingo) is analyzed. Dis Brandt (2011) presents evaluation of three popular web services (GT, Inter Tran, Tungutorg) for translation from Icelandic into English. 2143  In the study presented by Kit and Wong (2008), several web translation services (Babel Fish, GT, ProMT, SDL free translator, Systran, WorldLingo), used by law library users for translating from 13 languages into English, are discussed. According to the research presented by Seljan, Brki ć  and Ku č iš (2011) GT is a preferred online translation service for Croatian language. It shows better results in the Croatian-English direction (in the domains of football, law, and monitors) than in the English-Croatian direction (in the city description domain). GT, a free MT web service, is provided by Google Inc. GT initially used Systran-based translator. Many state-of-the-art MT systems use rule-based approach, e.g. Systran, which requires a long-term work of linguists and information scientists on grammars and vocabularies. GT employs statistical approach and relies on huge quantities of monolingual texts in the target language and of aligned bilingual texts. It applies machine learning techniques to build a translation model. GT translates between more than 60 languages. Translation from and into Croatian was introduced in May 2008. 3.   Automatic Evaluation Automatic evaluation metrics compare a machine translated text to a reference translation. Their primary task is high correlation with human evaluation. Human evaluation is considered a “gold standard”, however, it is a time-consuming, very subjective and expensive task. Automatic evaluation metrics are generally fast, cheap, and have minimal human labour requirements. There is no need for human bilingual speakers. However, currently used metrics do not differentiate well between very similar MT systems and give more reliable results on the whole test set than on individual sentences. One of the most popular automatic evaluation metrics is BLEU – Bilingual Evaluation Understudy, proposed by IBM (Papineni et al., 2002), which actually represents a standard for MT evaluation. BLEU matches translation n -grams with n -grams of its reference translation, and counts the number of matches on the sentence level. These sentence counts are aggregated over the whole test set. The matches are not dependent on the position in a sentence. Adequacy is accounted for in word precision, while fluency is accounted for in n -gram precision. Recall is compensated by brevity penalty factor. The final BLEU score is the geometric average of modified n -gram precisions. BLEU scores range from 0 to 1. According to Denkowski and Lavie (2010b) in AMTA Evaluation Tutorial, BLEU scores above 0.30 generally reflect understandable translations and BLEU scores above 0.50 reflect good and fluent translations. BLEU metric, being statistically-based and language independent, does not take into account morphological variants of a word, which is an important issue for inflective languages. This metric requires exact word matches, with all matches being equally weighted. Due to BLEU score low correlation with human adequacy and fluency judgments, Chiang et al. (2008) and Callison-Burch et al. (2006) recommend using BLEU for comparing similar systems or different versions of the same system, i.e. for what it was primarily designed. For the above stated reasons, an evaluation of translations from English into Croatian, a morphologically rich language, with multiple reference sets is conducted. Automatic metric scores are compared to human evaluation scores. Due to the need for qualitative evaluation, human evaluation is enriched by error analysis, which might be integrated into statistical approaches (Monti et al., 2011). 4.   Experimental Study 4.1 Test Set Description This research has been conducted on already existing English – Croatian parallel corpora of legislative documents, namely and These legislative documents are grouped according to the year of issue, and contain “duplicates” (with minor amendments, corrections, etc.). In total 200 unique source and reference translation pairs of different length and content have been chosen. However, some pre-processing has been deemed necessary (on the Croatian side), regarding typos, misspellings and other common mistakes that somehow persist despite the reviews. Furthermore, additional pre-processing has been done on documents containing mostly tables and formulas, not usable for analysis. Out of total 200 source sentences, two groups have been distinguished – 100 short sentences (21 words or less) and 100 long sentences (between and including 22 and 61 words). For each English sentence, three Croatian reference translations have been provided, the first translation being the “official” one (Ref1). MT translations have been obtained from GT. The statistical data on the average number of words in the test set is given in Table 1. # of sentences Source Ref1 Ref2 Ref3 GT 100 short 14.74 12.73 11.87 11.71 12.48 100 long 32.24 27.92 24.83 24.54 26.37 200 23.49 20.33 18.13 18.13 19.43  Table 1: Test set statistics. The fact that Croatian is morphologically rich, unlike English, reflects in the obvious difference in the number of words in translations, compared to source sentences. On the other hand, each additional reference translation reduces the number of words by getting rid of redundancies, characteristic for legislative expressions, while preserving the meaning in full, as well as the legislative tone. 2144  4.2. Human Evaluation 4.2.1. Profile of Evaluators The percentage of students in the total number of evaluators is 88.64%, out of which 86.36% are on the final year of their undergraduate studies, and 13.5% are attending graduate studies. The remaining 11.36% of evaluators have finished their studies, mostly 0-7 years ago. The self-evaluation of English language knowledge according to the Common European Framework of Reference for Languages is as follows – 0% have self-evaluated themselves for the level A1, 4.55% for A2, 15.91% for B1, 47.73% for B2, 22.73% for C1 and 9.09 for C2. The average self-evaluation grade in Croatian as their native language is 4 (on a 1-5 scale). Regarding their experience in translating, 72.73% of evaluators translate for private purposes, 6.82% professionally, 9.05% still do not have professional experience, but are in the preparation process (therefore high level of language proficiency), and 11.36% are not in the translation business. Regarding their experience in the use of translation tools, 60% of evaluators have already had experience in the use of free web translation services (GT, Systran, Babel Fish), 6% in the use of translation memories (SDL, Atril, Word Fast) and 6% of evaluators combine professional and free translation tools. 25.4% of evaluators still translate in the classic way, by directly typing in a text editor. Out of the total number of evaluators who use translation technology, 60% would like to take specialization courses, and 32% have already taken courses on the use of translation tools. When translating unknown words or syntactic structures, 40.19% use a web service, 28.04% hard copy of a dictionary, 21.50% an electronic dictionary, 5.54% a translation memory, and 3.74% a terminology database and a glossary. 4.2.2. Adequacy and Fluency Human evaluation has been performed by native speakers of Croatian language on a 1-5 scale using the criteria of fluency and adequacy. An online survey has been prepared for separate evaluation of both, fluency and adequacy, for short, as well as long sentences in sets of 25, whereas the total number of sentences has been 200. The survey consists of 4 polls per group and per each evaluation criterion. Fluency refers to the grammaticality and sounding “natural”, while adequacy checks whether any part of a message has been lost or distorted. The evaluation of the fluency criterion has been made on the following scale: Incomprehensible (1), Barely enough comprehensible (2), So-so; in-between good and bad (3), Very good (4), Impeccable (5). For evaluating adequacy, the following evaluation grades have been offered: Insufficient/inadequate/wrong information (1), Barely enough information (2), Intermediate level of information preserved (3), Very good but not complete (4), Complete information preserved (5). As presented in Table 2, short sentences have obtained higher average grades than the long ones according to both criteria (about 20% higher for fluency and about 15% higher for adequacy). 4.2.3. Fleiss’ Kappa Fleiss’ kappa is a measure used for assessing the inter-rater agreement (1). (1) The nominator calculates the degree of agreement actually achieved above chance, and the denominator the degree of agreement attainable above chance. The score is standardized to lie on a -1 to 1 scale, where 1 indicates perfect inter-rater agreement, 0 is exactly what would be expected by chance, and negative values indicate agreement less than chance. The interpretation of values is given in Table 3. The results are presented in Table 4. Fleiss’ kappa shows almost perfect agreement for the criteria of fluency, for all sentences. The evaluation according to the criterion of adequacy shows substantial level of inter-rater agreement. κ   Interpretation <0 poor agreement 0.01 – 0.20 slight agreement 0.21 – 0.40 fair agreement 0.41 – 0.60 moderate agreement 0.61 – 0.80 substantial agreement 0.80 – 1.00 almost perfect agreement Table 3: Interpretation of Fleiss’ kappa values. κ   Fluency Adequacy Average Short sentences 0.90 0.67 0.785 Long sentences 0.85 0.72 0.785 Table 4: Fleiss kappa on human evaluation. # of sentences Criterion Average Fluency Adequacy 100 short 3.40 3.56 3.48 100 long 2.86 3.13 3.00 Average 3.13 3.35 3.24 Table 2: Average human fluency and adequacy criteria. ee PPP −−= 1 κ   2145  4.2.4. Error Analysis Human evaluation is enriched by error analysis, in order to examine the influence of error types on fluency and adequacy, and to use it in further research. In the process of error analysis two professional translators have been engaged (they have not participated in the first part of the study), whose evaluation has proven exactly the same for all 200 sentences. They have reported the number of errors in the output of GT, compared to the first professional reference set. The error categories and error examples are given in Table 5, and the total number of errors per category is given in Table 6. Errors from several different categories often appear in the same sentence. As presented in Table 6, there is by far the highest number of morphological errors, i.e. on average 2.26 errors per sentence. Short sentences have on average 1.24 morphological errors per sentence, while this number doubles in long sentences, i.e. 3.28 errors per sentence. Out of other categories, there is about 1 error or less per sentence. The error categories in the descending order according to the number of errors are as follows – morphological errors, lexical errors, syntactic errors, surplus of words, omissions and not translated words, and, lastly, punctuation. Average number of errors per category # of sentences Omissions Surplus Morphological Lexical Syntactic Punctuation 100 short 0.27 0.27 1.24 0.73 0.5 0.09 100 long 0.59 0.61 3.28 1.19 1.17 0.37 200 0.43 0.44 2.26 0.96 0.84 0.23 Table 6: Error categories and number of errors per category. Error category Error example / elaboration Not translated / omitted words  Administration requiring the ships translated as  Administracija zahtijeva brodova  instead of Uprava koja od brodova zahtijeva  ili  Administracija koja zahtijeva od brodova Surplus of words in translation  There may be cases translated as Postoji svibanj biti slu č  ajevi  instead of U nekim slu č  ajevima  ili Postoje slu č  ajevi (there are also morphological and lexical errors in this example) Morphological errors / suffixes  Decisions … should be taken unanimously  translated as Odluke … mora biti donesena ednoglasno  instead of Odluke … moraju biti donesene jednoglasno  Lexical errors – wrong translation There may be cases translated as Postoji svibanj biti slu č  ajevi  instead of U nekim slu č  ajevima  ili Postoje slu č  ajevi Syntactic errors – word order Steps should therefore be taken  translated as Koraci stoga treba poduzeti  instead of Stogatreba poduzeti korake  Punctuation errors very rare; sometimes the comma was omitted or set on the wrong place Table 5: Error categories and error examples. Figure 1: Correlation between fluency and adequacy criteria and error type.   󰀭󰀰,󰀴󰀵󰀰󰀲󰀰󰀳󰀸󰀷󰀸󰀭󰀰,󰀱󰀹󰀵󰀱󰀴󰀱󰀸󰀵󰀭󰀰,󰀵󰀵󰀰󰀳󰀴󰀷󰀱󰀵󰀴󰀭󰀰,󰀳󰀷󰀲󰀹󰀸󰀰󰀵󰀴󰀷󰀭󰀰,󰀳󰀱󰀸󰀲󰀴󰀵󰀹󰀴󰀷󰀭󰀰,󰀱󰀷󰀳󰀸󰀰󰀰󰀷󰀳󰀹󰀭󰀰,󰀴󰀲󰀶󰀴󰀷󰀶󰀲󰀴󰀶󰀭󰀰,󰀲󰀴󰀶󰀲󰀲󰀲󰀰󰀸󰀴󰀭󰀰,󰀳󰀹󰀶󰀹󰀰󰀹󰀵󰀹󰀳󰀭󰀰,󰀴󰀴󰀴󰀸󰀲󰀰󰀶󰀹󰀸󰀭󰀰,󰀲󰀸󰀷󰀶󰀵󰀵󰀰󰀳󰀱󰀭󰀰,󰀰󰀹󰀵󰀵󰀱󰀳󰀷󰀲󰀳 󰀭󰀰,󰀶󰀭󰀰,󰀵󰀭󰀰,󰀴󰀭󰀰,󰀳󰀭󰀰,󰀲󰀭󰀰,󰀱󰀰 󰁎󰁯󰁮󰀭󰁴󰁲󰁡󰁮󰁳󰁬󰁡󰁴󰁥󰁤 󰁷. 󰁓󰁵󰁲󰁰󰁬󰁵󰁳 󰁯󰁦 󰁷󰁯󰁲󰁤󰁳 󰁍󰁯󰁲󰁰󰁨󰁯󰁬󰁯󰁧󰁩󰁣󰁡󰁬 󰁥󰁲󰁲. 󰁌󰁥󰁸󰁩󰁣󰁡󰁬 󰁥󰁲󰁲. 󰁓󰁹󰁮󰁴󰁡󰁣󰁴󰁩󰁣 󰁥󰁲󰁲. 󰁉󰁮󰁴󰁥󰁲󰁰󰁵󰁮󰁣󰁴󰁩󰁯󰁮 󰁆󰁬󰁵󰁥󰁮󰁣󰁹 󰁁󰁤󰁥󰁱󰁵󰁡󰁣󰁹 2146
Similar documents
View more...
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks