Web Ecology Project Twitter Scraper and Database Schema

Re-wrote Ruby Twitter scraper to be more efficient and designed and implemented a normalized database schema for storing tweets on Web Ecology Project’s server.

Revision History

Original Twitter scraper prototype was coded in perl by Ethan Zuckerman (mentioned in a blog post on Apr 13, 2009). The script used Twitter’s URL-based API (since changed) to scrape tweets from a simple search query on a particular term like a hashtag. The script was ported to Ruby by Web Ecology Project member Dave Fisher, who also set up the initial database.

I re-wrote Dave’s code to make the scraper more efficient in how it handled the initial scraping of tweets and in writing to the database. I also designed a database schema for the tweets which organized the various metadata connected to each tweet in specific tables and columns that could be indexed for faster and easier queries across Web Ecology Project’s growing dataset.

Use

My code was used to collect the tweets used in two studies I co-authored, Detecting Sadness in 140 Characters and Afghanistan and its Election on Twitter, and in another study authored by my Web Ecology Project colleagues, The Influentials.