Hyphe v0.0.0 : première sortie de notre webcrawler
Hyphe, notre nouveau webcrawler, permet aux utilisateurs non-techniques - dont les chercheurs en sciences sociales - de contrôler précisément la construction de leur corpus. Son atout majeur est de permettre un travail à la fois qualitatif (sélection et qualification) et quantitatif (récolte de données, indexation et stockage) sur une grande quantité de données.
Chronique
Hyphe, our new webcrawler, allows non-technical users, among which social sciences researchers and librarians, to control precisely how their web corpus is built. The main feature is to permit both qualitative (selection and qualification) and quantitative (data harvesting, indexing and storage) work on a huge amount of data.After a couple months delay due to the many parallel projects running, the DIME/Web team at the medialab is finally proud to present the very first release or the latest tool we have been working on: Hyphe.As the e-Diasporas project illustrates, working on controversies can be greatly facilitated by applying the Actors/Network Theory and study the connections between the different actors and categories involved in a controversy. Crawling the websites of these actors and other linked websites in order to constitute a network of the hyperlinks between these actors provides a rich source of information to lead exploratory analysis and confirm hypothesis on the actors relationships.Many tools exist to mine the web, among which two we already use at the médialab: the Navicrawler created by Mathieu Jacomy and the Issue Crawler from our partner Digital Method Initiative in Amsterdam. The existing Web archiving initiatives (in France BnF and INA) also show the interest one can have about web data. The Navicrawler allows on one side to build a web corpus step by step while browsing (filtering and qualifying websites to include in the corpus) and then export data to create a map based on link topology. On the other side, the Issue Crawler is a great tool to harvest a vast quantity of data automatically by crawling in "snowball" from site to site and filtering automatically sites to be included on the basis of a quantitative analysis. Those two approaches are complementary but could not be easily mixed up until now.Hyphe was conceived to address both these issues : it was based on the idea that researchers need to control how the corpus is built by ensuring themselves the qualitative decisions such as qualifying or grouping, but also need to be equipped with powerful tools capable of handling the huge amount of data available on the web. The goal is to allow both qualitative (selection and qualification) and quantitative (data harvesting, indexing and storage) work thanks to this new tool designed for non-technical users among which social sciences researchers and librarians.Hyphe relies therefore on three central principles :
- Explore only within a list of desired websites, whereas other websites are only being discovered until a human decision is being taken to crawl them as well;
- Define as precisely as possible what a website, or rather what we call a "Web Entity", really is, from a simple webpage to a combination of multiple domain names, including of course a whole website, or a combination of a Twitter webpage, a Wikipedia webpage and a subpart of a domain;
- Easily browse, classify and qualify the corpus while constituting it from a web interface.Using modern and robust technologies such as Lucene, MongoDB, Scrapy, Twisted, Thrift, Domino.js, Sigma.js or Bootstrap, Hyphe's current release is only a sneakpeek preview of what it should soon be able to accomplish. The roadmap now includes features such as managing multiple corpora within each instance, bypassing crawling issues (redirections, cookies, javascript-only pages, ...), handling multi-websites entities from the web interface, tagging the results, and so on...After trying it out already on a couple of projects (EMAPS, C-Section WHO, SITPOL...) and spending much time adapting our original thoughts and implementations to the actual usages, the current alpha version of Hyphe is now releasable for anyone to start trying it out and using it individually for science.This first release was made to be easily installable on an Ubuntu machine. Future versions will be prepared for a broader scope of configurations and our first experiences on MacOSX, Debian and CentOS were already encouraging.
You can download the package (tar.gz archive) here and read on GitHub the instructions to install and run it on Ubuntu.
Released as a free software, the whole sourcecode is also available on GitHub under the Free OpenSource Software licencesLGPL/Cecil C,so that advanced users or developers can contribute and redistribute it.A limited demo is available online here so that anyone can try out Hyphe's functionnalities. Since this is still a one-corpus-only instance, anyone will be able to reset the data and crawls or overload them with undesired data, so remember this is only meant for tryouts and you should rather try and install your own instance for serious work use. Please do not hesitate and let us know on GitHub's issues about any problem you would encounter while installing, running or even improving Hyphe.Long live Hyphe with many more releases to come!