Digital Humanities (DH) is an area of scholarly activity at the intersection of computing and the disciplines of the humanities. At LATTICE, we are active in the field since a few years now. We are specifically interested in the production of new natural language processing techniques for the different fields of the Humanities and the Social Sciences. The lab has developed two complementary lines of research:
- the use of Digital Humanities techniques for linguistics. We have developed techniques to automatically enrich corpora with different levels of annotation (part-of-speech, syntax, semantics). Realizations include the SEM parser, that provides part-of-speech annotations, chunks and named entities for contemporary French texts. Adaptation of this parser is ongoing for different languages, including Russian and morphologically-rich Finno-Ugric languages. The lab has also developed a series of resources, especially annotated corpora. A major recent output is the Syntactic Reference Corpus of Medieval French (SRCMF) covering a period from 842 to the end of the 13th century and containing about 251000 words with syntactic annotations. It is the first corpus of this size syntactically annotated and manually checked for Medieval french. The corpus will be realized online soon.
- the use of natural language processing techniques (NLP) for different Digital Humanities areas. Ongoing projects covers a wide range of topics from the social sciences to the Humanities. We have been collaborating for example with the UCL Centre for Digital Humanities on the Transcribe Bentham corpus. In this project, we aimed at analysing the different texts included in the corpus, extracting their main topics, clustering together related texts and providing original and meaningful visualisations of the structure of the corpus. Another project deals with Climate Negotiation analysis. In this context, our system identifies points supported and opposed by negotiating actors and extracts key concepts from those points. The results are displayed in a specific interface, allowing for a comparison of different actors’ positions.
We have recently moved to the analysis of literary texts. Recent projects include a collaboration called `Distant Rhythm’, with UNED in Madrid (Open University in Spain). Our goal here is to automatically detect enjambments in four centuries of Spanish Sonnets. We are also working on the automatic recognition of characters in French novels (entity recognition and linking, as well as the recognition of characters who are not named in the text, but referred to by a denomination like their function or their job). The goal is to examine whether there are some patterns in the way characters appear, exchange and interact in novels.
There are a lot of open challenges we would like to tackle in the future, like the analysis of the evolution of French authors websites over time, the automatic detection of »alexandrine » in French prose, or an analysis of metaphors in novels, which is a hard task from a computational point of view. In brief we are interested in any question that is challenging from a computational point of view, interesting from a scholarly point of view, and that exploit the richness of the corpora available nowadays.
Our research has been presented and published mainly in conferences in linguistics for the first line of research, and in the Digital Humanities Conferences (the main forum for research in the domain) for what concerns NLP applied to DH issues. We are also preparing extended versions of these publications to be published in specific journals. A selection of references is given below.
Recent and Current projects
- The ANR-DFG SRCMF project. The Syntactic Reference Corpus of Medieval French (SRCMF) was financed by the Agence nationale de la recherche (ANR) and Deutsche Forschungsgemeinschaft (DFG) between 2010 and 2013 (principal investigators: Sophie Prévost, LATTICE and Achim Stein, University of Stuttgart). The SRCMF is the first dependency treebank for Medieval French. It consists of syntactically annotated parts of two text corpora of Medieval French: the Base de Français Médiéval (BFM), and the Nouveau Corpus d’Amsterdam (NCA). Texts covering the Old French period from 842 to the end of the 13th century and containing about 251000 words were annotated manually and published along with the tools and documentation presented on the project website.
- LAKME is a PSL funded project exploring new NLP techniques (esp. machine learning techniques) to annotate scholarly relevant corpora. The project focuses on morphologically-rich languages that are especially challenging for current NLP systems. Three languages (or groups of languages) are considered: Rabbinic Hebrew, Medieval French and some Uralic languages (esp. Finnish, Komi and Udmurt). The project is a collaboration between Lattice (PI, Thierry Poibeau), the Ecole Pratique des Hautes Etudes (Daniel Stoekl Ben Ezra) and the Ecole Nationale des Chartes (Jean-Baptiste Camps).
- The ANR DEMOCRAT Project also contributes to the research in Digital Humanities in proposing new methods for the automatic annotation of co-reference chains in texts (mostly Medieval and contemporary French texts).
- UCL Centre for Digital Humanities. We collaborate with DH@UCL since 2014 through the Bentham Project. UCL has provided the Transcribe Bentham corpus and LATTICE has developed text mining and content analysis tools to extract key information from the corpus. See our publications and the online demo.
- Digital Humanities Innovation Lab (LINHD) at Universidad Nacional de Educación a Distancia UNED (Open University) of Spain in Madrid. With UNED, we have started a collaboration over a collection of four centuries of Spanish poems.
- The Institut für Linguistik/Romanistik of the University of Stuttgart. We have developed NLP tools for the automatic analysis and annotation of Medieval French corpora together (see the SRCMF project).
- The Department of Theoretical and Applied Linguistics at the University of Cambridge. We work together with DTAL on the development of new NLP techniques for DH projects.
- Collaborations are expected to start soon with other labs in Europe.
- Within PSL, collaborations are ongoing with EPHE and ENC, see the LAKME project for more information. LATTICE is also one of the leading labs involved in the E-Philologie series of doctoral courses exploring different facets of DH at the Master and Doctoral level (ENS, EPHE, ENC, EHESS).
- We are also working with AOROC, a research uni specialized in archeology at EPHE and ENS. The collaboration mainly consists in extracting key information from written documents in order to provide semantic indexing and search functionalities.
- We are a member of the labex TRANSFERS, which also includes a Digital Humanities group mainly working on databases and maps.
- We are collaborating with the Medialab at Sciences Po Paris on the Climate Negotiation analysis project.
Recent applications and demos
- Climate change analysis
- Mapping the Bentham corpus
- Entity Linking applied to the PoliInformatics corpus
- SEM, our part-of-speech, chunker and named entity recognizer for French
Three Selected Publications
- Achim Stein, Sophie Prévost. Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF). P. Bennett, M. Durrell, S. Scheible and R. Whitt. New Methods in Historical Corpus Linguistics, Narr Verlag, pp.275-282, 2013, Corpus Linguistics and International Perspectives on Language, 978-3-8233-6760-4.
- Estelle Tieberghien, Frédérique Mélanie-Becquet, Pablo Ruiz Fabo, Thierry Poibeau, Melissa Terras, et Tim Causer. Mapping the Bentham Corpus. Digital Humanities 2016, Jul 2016, Krakow, Poland. 2016, Digital Humanities 2016.
- Pablo Ruiz, Clément Plancq, Thierry Poibeau. More than Word Cooccurrence: Exploring Support and Opposition in International Climate Negotiations with Semantic Parsing. LREC: The 10th Language Resources and Evaluation Conference, May 2016, Portoroz, Slovenia. pp. 1902-1908, 2016.