Salomé Do

I work on testing the validity and reliability of machine/deep learning methods for computational social sciences, especially for computer-aided content analysis.

Postdoctoral researcher

Personal website: https://sally14.github.io/

Coming from a mathematics and computer science background, Salomé later specialized in statistics, probabilities and natural language processing at the Ecole Nationale de la Statistique et de l'Administration Economique. During her PhD (supervised by Jean-Philippe Cointet, médialab and Thierry Poibeau, LATTICE), she tried to defend that the combination of pretrained-LLMs and a careful approach to supervised learning could help traditional content analysis by enabling to analyze virtually any amount of documents in a very short span of time, given some human-annotated data that serves as a learning base for the model.

Due to the recent advances in NLP and following this supervised approach, computer-aided content analysis could save human coder's time and yield human-level results on potentially very hard coding tasks. For instance, detecting which frames are used in a press article is very difficult due to their abstract and linguistically-disembodied nature, requiring a high level of text understanding. In a longitudinal study design, tracking the evolution of frames over time using only human coders would be highly time-and-money consuming, justifying the need for computer-aided content analysis.

However, even the best-trained models make errors. In a supervised training setting, an evaluation set is always kept apart the training set to compute error metrics on samples the model has never been exposed to before. In a generative setting, this is not always the case, and performance metrics are even less reported.

Keeping a blind eye on these errors, even when they are comparable to a human coder's occasional mistakes, is threatening to the scientific validity of these studies, especially when the errors are put at the whole corpus scale. This issue triggered my statistician side: beyond the "AI hype", is it even valuable to exhaustively "analyze" a corpus using an (inevitably error-making) model when well-sampled expert annotations could provide acceptable estimates? What are the uncertainty measures associated to statistics computed from model predictions? Does an exhaustive analysis of the corpus through the use of a NLP model reduce the estimate's associated uncertainty, or do the errors from the model just generate more noise around the measure? Modern deep learning models such as LLMs are considered as very complex computational black boxes, making uncertainty measures difficult to provide.

Her research goal during this post-doctoral contract is to contribute to the discussion on NLP models' use validity, reliability and reproducibility when used for computational social sciences, and to contribute building a mathematical framework for uncertainty estimation when possible.

In addition to this and within the engineers' team, Salomé's mission is to support the ongoing development of the médialab's AI computing resources.

⚐ Cette page existe aussi en français.