Starting with Ruby and GDELT

GDELT, or Global Data on Events, Location and Tone, is a big, free database, updated daily, that claims to monitor "the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages" and identify "the people, locations, organizations, counts, themes, sources, and events driving our global society every second of every day." It's gotten some media attention in the last few years, as well as criticism. Many objections point out that GDELT presents itself, and is often used, as a comprehensive database of events, rather than media coverage of those events.

GDELT servers intake hundreds of thousands of news reports and use natural language processing to identify discrete "events." If GDELT's algorithms determine that another article references the same event, it's added to the same entry in the database; each entry records how many times the event has been mentioned in world media. A particularly newsworthy event might have dozens of media mentions, but would still occupy just a single entry in the GDELT database. Likewise, a single news story is likely to yield multiple events.

No doubt, GDELT's algorithms are sophisticated, but tapping this database means you're trusting a computer to distinguish between one event and the next, code it as a "protest," "meeting," "missile strike," etc., and geocode it meaningfully. Plus, as lots of people pointed out with reference to 538's notorious kidnapping analysis, it's pretty naïve to assume that news coverage accurately mirrors events on the ground.

All those caveats notwithstanding, GDELT does seem like a good source for tracking how events are covered in global media, rather than events themselves. Given its scale, GDELT could do well for media analysis. Everything below is an initial attempt at using GDELT data for this purpose -- plus a chance to practice some Ruby.

I downloaded data for July 29, 2014 from the GDELT site, a TSV file with nearly 150,000 entries. I added column headers at the top so I'd be able to query different rows by name rather than position. Then I used Ruby's CSV class to import the file. (GDELT files have a .csv extension, despite being TSV files.)

require "csv"
parsed_file = CSV.read("20140729.export.CSV", { col_sep: "\t", headers:true, header_converters: :symbol })

GDELT codes events according to the CAMEO (Conflict and Mediation Event Observations) system, allowing users to, say, pull out all the protests that occurred within a particular date range. (GDELT events are also geocoded, allowing researchers to produce maps like this.) As an experiment, I designated all the codes between 15 ("exhibit force posture") and 20 ("use unconventional mass violence") as "violent" events, and tallied the number of mentions GDELT recorded for each event taking place in relation to Israel.

country = "IS" #"IS" = "Israel"
mentions = 0
violent_events = []

def violent_coding?(row, country)
  if (row[:eventrootcode] == "15" || row[:eventrootcode] == "16" || row[:eventrootcode] == "17" || row[:eventrootcode] == "18" || row[:eventrootcode] == "19" || row[:eventrootcode] == "20") && (row[:actiongeo_countrycode] == country)
    return true
  else
    return false
  end
end 

parsed_file.each do |row|
  if violent_coding?(row, country)
    mentions += row[:nummentions].to_i
    violent_events << "#{row[:actor1geo_lat]}, #{row[:actor1geo_long]}, #{row[:sourceurl]}" unless violent_events.include? "#{row[:actor1geo_lat]}, #{row[:actor1geo_long]}, #{row[:sourceurl]}"
  end
end

This yielded a number for total mentions (49,190) and, as a further experiment, a list of unique latitude/longitude values and article URLs that could be mapped very easily. Changing nummentions to numarticles produces a count of "source documents containing one or more mentions" of each event: 48,354. Again, this is the number of articles mentioning "violent" events taking place with relation to Israel on July 29, 2014, according to GDELT.

I ran the nummentions "violence" count on a few other conflicts and came up with the following numbers.

Nothing surprising here, except possibly that GDELT recorded more coverage of "violent" events in Afghanistan than in Syria.

If you do actually map these GDELT entries, you'll find that these aren't events taking place "in" so much as "relating to" each country. For example, GDELT recorded this article, geocoded to Somalia, as having an "ActionGeo_CountryCode" of "IS," for Israel. But this isn't important if you're simply using GDELT to gauge global media coverage.

Next, I'd like to graph these numbers for a longer period of time against casualty estimates for each conflict -- assuming I can find some reliable numbers.