Venue de Serge Sharoff

En mai et juin 2019, le labex TransferS et Thierry Poibeau (Lattice) accueillent Serge Sharoff, maître de conférences à l’institut des langues, cultures et sociétés de l’Université de Leeds.

Traitement automatique de langues et multilinguisme

Alors qu’il existe plus de 6000 langues dans le monde, on dispose de ressources électroniques permettant de développer des analyseurs performants (syntaxiques ou sémantiques) pour une cinquantaine de langues seulement. Et encore ne dispose-t-on de données en quantité suffisantes pour « entraîner » des systèmes de traitement automatique que pour une minorité de ces cinquante langues. Pour contourner ces problèmes, les chercheurs mettent aujourd’hui au point des systèmes reposant sur des représentations « multilingues » de l’information. Bien que cela soit contre-intuitif de prime abord, il est possible d’obtenir un système performant pour une langue X à partir du transfert de connaissances obtenu par et pour une langue Y (en fait, on utilise plusieurs langues à chaque fois dans les systèmes de traitement moderne de ce type). La question des systèmes de représentation multilingue des connaissances, qui est en général abordée d’un point de vue purement « ingénierique », mériterait un regard pluridisciplinaire. Serge Sharoff discutera notamment de l’impact des recherches en typologie linguistique et de l’impact du contact entre les langues sur le traitement automatique (par exemple, bien que le komi – langue finno-ougrienne du nord de la Russie – n’ait pas du tout la même origine que le russe, le contact entre les deux langues et le bilinguisme de tous les locuteurs komi a amené évidemment une grande porosité de la langue qui a directement transposé des structures du russe en komi).

Mardi 14 mai : Evaluation et utilisabilité de la traduction automatique

Salle Cavaillès, à partir de 10h

Translation quality evaluation : MT vs Human translation

Serge Sharoff (Univ. Leeds)

In modern life we are surrounded by translations from other languages, some
of which are unreliable. This talk investigates the task of detecting low-quality human translations automatically. The task is important in many applications, such as translation training, screening candidates or monitoring translation submissions, while few resources are available for training Machine Learning models for this task. In my talk, I will show how to approximate a proper training corpus with a composite one created from low quality MT outputs and good quality human translations.

Post-editing machine translation : MT technologies in real-life use scenarios

Hanna Martikainen (CLILLAC-ARP — Univ Paris Diderot)

It is generally acknowledged that machine-translated output is of sufficient quality today for commercial use with post-editing, and the technology is being integrated into translation workflows in various settings (Koponen 2016). With the recent advent of neural MT and the undeniable advances in fluency it has brought about, this trend is expected to grow even stronger. However, automatic and human evaluation metrics of MT often yield inconsistent results on quality (see for instance Castilho et al. 2017), and discrepancies between automatic metrics such as HTER scores and perceived post-editing effort as well as post-editing time have been observed (see for instance Koponen et al. 2019). In this talk, I will present some real-life scenarios of MT integration into translation workflows in professional as well as educational settings and discuss end-users’ perception of MT. I will seek to determine what kind of factors are known to influence the use and usefulness of MT in actual settings and explore the different parameters that affect it, with a specific focus on the emerging paradigm of neural MT.

Lundi 20 mai

Lattice, salle 512, Montrouge, 11h

Text typology vs text topology : reliable detection of genres

Serge Sharoff (Univ. Leeds)

There are different kinds of texts on the Web, from FAQs to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. The talk will present a topological approach to text typology in which the texts are described in terms of their similarity to prototype
genres. The suggested set of categories is designed to be applicable to any text on the Web and to be reliable in annotation practice. Reliably annotated texts also provide the basis for automatic genre classification.

[Ce séminaire sera précédé d’un autre séminaire par Ismael Ramos Ruiz, de l’Université de Caen]

Mercredi 5 juin, matinée : Approches multilingues et transfert entre langues en TAL (traitement automatique des langues)

Salle Celan, à partir de 9h30

Language adaptation : exploiting similarity between the languages in NLP models

Serge Sharoff (Univ. Leeds)

Some languages have very few NLP resources, while many of them are closely related to better resourced languages. This talk explores how the similarity between the languages can be utilised by porting resources from better to lesser resourced languages. This can be achieved by combining cross-lingual embedding methods with a lexical similarity measure which is based on detection of cognates. I show that the resulting embedding space helps in such applications
as morphological prediction and Named Entity Recognition, when a model is trained using data from better resourced languages and is applied to lesser resourced ones.

[Suivi de deux autres exposés par Laurent Besacier (Univ. Grenoble Alpes) et Marie Candito (Univ. Paris Diderot)]

Mercredi 5 juin, après-midi (salle à déterminer)

Natural Language Processing in Russia today, an Overview

Informal workshop, discussion with Serge Sharoff


Serge Sharoff joined the University of Leeds, UK, in 2003 after obtaining my PhD in 1997 from the Moscow Lomonosov State University and postdoctoral appointments at the Russian Research Institute for Artificial Intelligence (1997-2000), and Humboldt Research fellowship at the Univesity of Bielefeld (2001-2002). His research focuses on Natural Language Processing, including automated methods for collecting corpora from the web, their analysis in terms of domains and genres and extraction of lexicons and terminology from corpora. The application domains for this kind of research in the Digital Humanities include text annotation, information retrieval, machine translation and computer-assisted language learning. His research stresses the inherent multilinguality of NLP, which implies that tools and resources can be ported across languages.


Entrée libre dans la limite des places disponibles

Mardi 14 mai 2019, à partir de 10h
ENS, 45 rue d’Ulm, 75005
salle Cavaillès (1er étage, escalier A)
Lundi 20 mai 2019, 11h
Laboratoire Lattice,
1 rue Maurice Arnoux 92120 Montrouge
salle 512
Mercredi 5 juin 2019, à partir de 9h30
ENS, 45 rue d’Ulm, 75005
salle Celan (RdC, escalier A)

A lire aussi