Study and Comparison of Rule-Based and Statistical Catalan-Spanish Machine Translation Systems

Marta R. Costa-Jussà

Barcelona Media Innovation Center
Av. Diagonal 177, 08018 Barcelona, Spain
Mireia Farrús

Universitat Oberta de Catalunya
Av. Tibidabo, 47. 08035 Barcelona, Spain
Jose B. Mariño

Universitat Politecnica de Catalunya, TALP Research Center
Jordi Girona 1-3, 08034 Barcelona, Spain
Jose A. R. Fonollosa

Universitat Politecnica de Catalunya, TALP Research Center
Jordi Girona 1-3, 08034 Barcelona, Spain

Study and Comparison of Rule-Based and Statistical Catalan-Spanish Machine Translation Systems

keywords: Rule-based machine translation, statistical machine translation, Catalan, Spanish

Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core methodology. Since both paradigms have been largely used during the last years, one of the aims in the research community is to know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of several specific Catalan-Spanish machine translation systems: two rule-based and two corpus-based (particularly, statistical-based) systems, all of them freely available on the web. The translation quality analysis is performed under two different domains: journalistic and medical. The systems are evaluated by using standard automatic measures, as well as by native human evaluators. In addition to these traditional evaluation procedures, this paper reports a novel linguistic evaluation, which provides information about the errors encountered at the orthographic, morphological, lexical, semantic and syntactic levels. Results show that while rule-based systems provide a better performance at orthographic and morphological levels, statistical systems tend to commit less semantic errors. Furthermore, results show all the evaluations performed are characterised by some degree of correlation, and human evaluators tend to be specially critical with semantic and syntactic errors.

mathematics subject classification 2000: 68, 68T50

reference: Vol. 31, 2012, No. 2, pp. 245–270

Computing and Informatics

formerly Computers and Artificial Intelligence

Study and Comparison of Rule-Based and Statistical Catalan-Spanish Machine Translation Systems