Rachel Bawden working to enhance machine translation models

From linguistics to computer science: putting sentences into their context

British-born Rachel Bawden converted to computational linguistics after obtaining her Bachelor of Arts (BA) in French and Linguistics from Oxford University. ‘I wanted to take a more technical path that would offer more practical applications,’ she tells us. ‘I therefore took the 1st year of the Language Science Masters with a specialisation in linguistic engineering at Sorbonne Nouvelle University, followed by two years at Paris Diderot University, which also offers a computational linguistics programme.’ On completing her Masters, Rachel began a PhD at the LIMSI laboratory, now known as the Interdisciplinary Laboratory of Digital Sciences (LISN).

In 2018, she defended her thesis at Paris-Saclay University entitled Going beyond the sentence: Contextual Machine Translation of Dialogue, focused on improving how machine translation systems handle context:

Verbatim

Certain words (and therefore the sentences they are found in) cannot be understood without context. Take the French word “avocat” for example, which has two distinct meanings [“avocado” and “lawyer”], or the English word “bank” ["river bank" or "financial institute"]. The challenge is how best to integrate contextual information present in the text or its metadata.

Rachel Bawden

Researcher at Inria, Paris in the ALMAnaCH team

Her outstanding work was awarded the 2019 Thesis Prize by the ATALA (Association for Natural Language Processing). Since then, she has continued along this research axis as part of a collaboration with the WILLOW project-team, in the form of joint supervision of PhD student Matthieu Futeral, who explores the integration of visual context in machine translation.

Photo de deux verres pour boire qui sont brisés. À côté, la phrase de départ en anglais et les deux traductions possibles du mot "glasses" : "lunettes" ou "verres". — Multimodal machine translation (MMT) generally refers to the use of additional non-textual data in text-based machine translation (MT). In (Futeral et al., 2023), texts are accompanied by images, the aim being to use visual data to improve the translation of ambiguous sentences. The ambiguity of the word ‘glasses’, for example, can be resolved with the help of an image.

A true vocation, offering freedom and stability

Following her PhD, Rachel Bawden joined the University of Edinburgh’s Institute for Language, Cognition & Computation (ILCC) as a postdoctoral student. She carried out research into the machine translation of ‘low-resource’ languages, for which there is little data to train models. The researcher focused on two Indian languages in particular: Gujarati, spoken mainly in the west of India, and Tamil, spoken in the south.

In 2020, Rachel joined the ALMAnaCH project-team led by Benoît Sagot at Inria Paris, as a ‘chargée de recherches’ in Natural Language Processing (NLP) and Machine Translation (MT), a move which was no coincidence:

Verbatim

Five years previously, I carried out my Masters internship at Alpage (Inria/Paris-Diderot University Joint Research Unit), ALMAnaCH’s predecessor. Joining ALMAnaCH felt natural to me,’ she tells us. ‘I was also attracted to public research, which for me represents a form of freedom and stability, especially for a profession I do out of passion. Inria offered me a working environment in which I could carry out my research over the long term, without as much pressure for immediate results, because the quality and culmination of work is what counts.

Making models more robust

Rachel Bawden contributes to numerous research projects in the ALMAnaCH project-team, including MaTOS (Machine Translation for Open Science). Supported by the French National Research Agency, the ANR, the aim of the project is to ‘develop new approaches to machine translation for whole scientific documents in French and English, in addition to automatic metrics to assess the quality of the translations produced.’ The researcher is also involved in another ANR project, TraLaLaM , dedicated to exploring the use of large language models (LLMs) for the machine translation of low-resource languages, and in particular dialects and regional languages. These aims are similar to those of the COLaF, an Inria challenge (DEFI) dedicated to the collection and creation of text, speech and sign language corpora for French and the other languages of France in all their diversity.

Photo d'une carte perforée de l'expérience IBM de Georgetown en 1954. — Almost 70 years after IBM’s Georgetown experiment (punch card above), MaTOS 'Machine Translation for Open Science) revisits the machine translation of complete scientific documents in order to facilitate and open access to scientific knowledge. MaTOS is supported by the French National Research Agency in the framework of the AAPG 2022 - CES 23 (artificial intelligence and data science).

Rachel is also an active member of the PRAIRIE Institute (PaRis AI Research InstitutE) as the holder of a ‘springboard’ chair. ‘My research within PRAIRIE focuses on the robustness of machine translation models, in order to produce high-quality models for texts displaying high degrees of linguistic variations. Such texts can be found notably on social networks, where users will typically write using acronyms, incomplete sentences or spelling mistakes,’ she explains. To support her research, the institute funds a doctoral student, Lydia Nishimwe, who Rachel co-supervises with Benoît Sagot. ‘Despite dealing with a very different type of text, this research shares similarities with another research topic I have had the opportunity to explore in collaboration with colleagues on the automatic processing of 17th-century French and specifically its normalisation into contemporary French,’ she points out.

Une phrase en français du 17ème siècle normalisée en français contemporain avec, en exergue, les différences de convention orthographique et les évolutions linguistiques. — A sentence in Modern French (from the 17th century) and its normalisation into contemporary French.

A vocation in artificial intelligence

As the diversity of ALMAnaCH projects demonstrates, natural language processing, a sub-field of artificial intelligence, is a booming sector.

Verbatim

The field is undergoing extremely rapid changes,’says Rachel, ‘and with its progress and innovations come new issues. Legal questions have been raised recently, for instance, over the type of data used to develop and train models.

What advice does the researcher have for young people interested in the field of natural language processing? ‘Dare to change your path, be courageous and don’t hesitate to change direction if you need to. Before starting my masters, I was given a book for my birthday entitled Speech and Language Processing by Daniel Jurafsky and James Martin. Reading this book was like a sign confirming I had found my vocation’.

Tableau de Pieter Brueghel l’Ancien représentant la "Grande Tour de Babel". — “There are about 7,000 languages in the world. Some aspects of human language appear to be universal or are ‘statistical universals’, i.e. they apply to the majority of languages. For example, each language seems to have nouns and verbs, ways of asking questions or giving commands, and linguistic mechanisms to indicate agreement or disagreement. However, languages also exhibit translation discrepancies and understanding the causes of these can help us to build better machine translation models.” (Speech and Language Processing, Daniel Jurafsky and James Martin. Image: Pieter Brueghel the Elder, The Great Tower of Babel, circa 1563, Kunsthistorisches Museum, Vienna, Austria. © Public domain).

ALMAnaCH: at the heart of natural language processing

Language models, machine translation, text simplification, resource development, the processing of historical corpora thanks to OCR (optical character recognition) and HTR (handwritten text recognition) are just some of the fields of application explored by the members of ALMAnaCH (Automatic Language Modelling and Analysis & Computational Humanities). Created in 2017, the project-team specialises in natural language processing (NLP) and digital humanities (DH). The team’s research covers a wide range of topics including but not limited to neural language models, machine translation, dialogue modelling, language resource development (monolingual, parallel, annotated corpora, lexicons, etc.), interactive AI, evaluation strategies, information extraction, optical character recognition and handwritten text recognition. In November 2023, Benoît Sagot, head of ALMAnaCH, was awarded the Collège de France ‘Computer and Digital Sciences Chair’.

Find out more

Rachel Bawden’s website.
“RoCS-MT: Robustness Challenge Set for Machine Translation”. In Proceedings of the Eighth Conference on Machine Translation, pages 198–216, Singapore. Association for Computational Linguistics. Rachel Bawden and Benoît Sagot. 2023.
“Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation”. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5394–5413, Toronto, Canada. Association for Computational Linguistics. Matthieu Futeral, Cordelia Schmid, Ivan Laptev, Benoît Sagot and Rachel Bawden. 2023.
“Automatic Normalisation of Early Modern French”. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3354–3366, Marseille, France. European Language Resources Association. Rachel Bawden, Jonathan Poinhos, Eleni Kogkitsidou, Philippe Gambette, Benoît Sagot and Simon Gabay. 2022.