Projet PSL « LAKME »

The LAKME Project is funded by PSL Research University ( “Investissements d’avenir” ANR-10-IDEX-0001-02 PSL*).

Project Summary

LAKME is a project dedicated to the automatic production of linguistically annotated corpora. Textual corpora are nowadays largely available, including for ancient as well as for under-resourced languages. However, from a linguistic point of view, these corpora are nothing if they are not enriched with linguistic information, allowing the researcher to go beyond purely “surfacic” patterns. At the same time, machine learning techniques and natural language processing (NLP) have made much progress, so that it is now possible to accurately analyse texts (at least at the morphosyntactic and syntactic level). Most research so far has been done on English (and other Indo-European languages) but much more still needs to be done on other languages (esp. morphology rich languages). This project aims at developing new machine learning methods for text annotation. Targeted languages are Hebrew, French (esp. Medieval French) and Uralic languages.

The project is based on a strong collaboration between three PSL institutions (ENS, EPHE, ENC) that want to develop several joint projects in the field of Digital Humanities, in research as well as in teaching (cf. the e-philology proposal). This project is unique in its multidisciplinary approach (mixing researchers from computer science, linguistics and textual scholarship), with the goal to contribute with original results in these different domains.

Languages and corpora considered

The project will mainly address three languages or groups of languages:

  • Hebrew, especially early rabbinic Hebrew (ca. 3rd to perhaps 4th century)
  • French, with a diachronic perspective. We will esp. address Medieval (10-beginning of the 15th century) and Classical French (16-17th century)
  • To a lesser extent, some Uralic languages may also be addressed for pilot studies.

We want to address morphologically rich languages. We also want to contrast modern languages with their more ancient counterparts, so as to be able to develop diachronic studies (for example, how the passage from free word order to a fixed word order in the history of French can be explained).

From a technical point of view, it has been observed that languages attested at different periods of time pose crucial challenges to analysers and parsers, not only but also in the realm of non-standardized orthography. The application of a state-of-the art analyser of Modern Hebrew to Rabbinic Hebrew has so far led to mixed results. Many words are unknown and techniques used for word guessing (i.e. try to dynamically categorize unknown words based on their internal structure or external context) seem to be less efficient on Rabbinic Hebrew.

Uralic will be considered as a testbed for evaluating the robustness of the developed algorithms. These languages are considered because of a specific interest at Lattice, in collaboration with collaborators in France (Inalco) and abroad (Moscow, Helsinki).

Partners

This project will involve three main partners from PSL, as well as other international partners who will bring an invaluable expertise in natural language processing and linguistic studies.

The three PSL partners are :

  • LATTICE (Laboratoire Langues, textes, Traitements informatiques et Cognition, UMR8094) of the Ecole normale supérieure. LATTICE is a research unit specialized in linguistics and natural language processing. Within the project, LATTICE will be the main centre for the development of new machine learning techniques for text annotation and will participate in the annotation of Medieval French. The lab will also develop tools for the automatic analysis of Uralic languages.
  • The programme for digital humanities of the EPHE with its director, Daniel Stökl Ben Ezra, member of the team of Mondes sémitiques at Orient et Méditerranée (UMR 8167), specialized in ancient Semitic languages. This team will be responsible for the preparation of the reference corpus of the Hebrew Rabbinic corpus and the analysis of the results provided by the automatic tool.
  • ENC – Centre Jean Mabillon (EA 3624, dir. Olivier Poncet), specialized in historical sciences, particularly on the study of historical, philological and legal sources, and of the written production from the Middle Ages to our day, its production, transmission and edition. A strong axis of The Centre Jean Mabillon, along with the Éditions électroniques de l’École des chartes (ÉLEC), is the elaboration of digital editions. Editions in progress by researchers of this team, such as Frederic Duval, will also be included in the corpora. – Master “Technologies numériques appliquées à l’histoire” (resp. pédagogique Jean-Baptiste Camps), who trains master students in digital editing.