Thompson & Thomson: Desktop deduplication software
Guillaume Plique will present a collection of deduplication algorithm and how they are used by various social sciences projects.
It is quite common, in social sciences, to deal with manually transcribed traces coming from archives or audio recordings etc. It is also common to cross data from various systems relying on different structures and norms. Also, a lot of textual databases contain orthographic variants for a same object, person or concept. It then becomes hard to compute legitimate aggregations without manually cleaning the considered data. But, in the Big Data era, data cleaning and harmonisation is becoming a cyclopean task.
But it does not have to be that way and it is largely possible, nowadays, to assist this work thanks to computers and specialized algorithms, all while keeping humans central to the process. Guillaume Plique thus offers, during this seminar, to give a guided tour of those mechanisms and algorithms and to present how they are used in social sciences projects. Finally, it will be the time to discuss ways of improving existing tools in order to make this kind of work easier in the future.
Guillaume Plique joined the médialab team in 2013 and assists social sciences researchers with the lab's various projects and help them regarding methodology.
Having nurtured strong skills in webmining, machine learning and data visualisation, he also develops and maintains some of the lab's numerous Open Source tools and libraries.
Seminar is open to all, subject to availability. Pre-registration is mandatory.