1. médialab Sciences Po
  2. Productions
  3. Gazouilloire

Gazouilloiremade by the médialab

a backend tool to run massive Twitter data harvesting on the long term

Tools – Software

Benjamin Ooghe-Tabanou

Harvesting data on Twitter using its API is quite easy for developers by reusing the many libraries built to this intent. Although, maintaining such harvesting on the long term in order to build large longitudinal corpus ca be more complex.

Gazouilloire helps setting up such data collections and ensures a maximal exhaustivity by using both streaming and search APIs, as well as offering a variety of extra features and configuration options, among which:

  • harvesting all tweets matching a set of keywords, users or url pieces
  • filter the harvesting of some keywords on specific time periods
  • filter on the language used in the tweets (as detected by Twitter's algorithms)
  • filter on a geographical zone from where tweet were sent
  • automatically resolve shortened urls shared
  • catch up entire thread conversations when a collected tweet answers another
  • catch up to 8 days in the past as permitted by Twitter's free API
  • download systematically all image and video contents embedded in harvested tweets