Bootcamp: Data Driven Journalism

So much of the training and retraining of journalists seems to be focused on getting them to be multimedia reporters, backpack journalists or one of the other buzzwords we use for collecting audio and visual content and presenting it online.

Multimedia is one of three things that make online journalism different from offline journalism, but the other two things — interactivity and user-control — depend largely on journalists understanding data driven journalism. This isn’t about numbers, but about structured data. Here’s a bootcamp that’s intended to introduce journalists to the tools and concepts of structured data and data driven journalism.

As with all classes, this one has some required reading before you start:

I start with Phil Meyer’s “Precision Journalism” book both to illustrate that a lot of what we are talking about is not new. A lot of it is just a way of making computer assisted reporting more visual and more transparent. But the other reason I like to use it is because he does a nice job of explaining the mindset you need to have going in to any data driven story. And mindset is infinitely more important than technical skill.

Phil Meyer writes argues that journalists and scientists share a lot of similar values.

  1. Skepticism
  2. Openness – Transparency – Replicability
  3. Operationalization (what can be observed and described)
  4. Tentativeness of “truth” – new facts can be discovered that might change our understanding
  5. Parsimony – given then choice between two explanations, favor the simpler.

Data driven journalism is not about creating a data dump on your audience — simply digitizing and organizing a lot of atomic-level facts. Good data driven journalism helps readers understand the data. This means that journalists need to always have a hypothesis in their mind — and it must be one that is relevant to the audience. The data is used to test that hypothesis.

This kind of thinking leads to a nice view of objectivity that works well in the pursuit of accountability journalism amid the cacophony of the Web: Journalists need to be objective going in to their experiment, but conclusive coming out of it. And by remaining transparent during the reporting process, journalists can retain their humility by holding up their reporting to public scrutiny.

What can you do with data driven journalism?

Holovaty argues that a lot of journalism is better done with structured data – or, “information with attributes that are consistent across a domain,” as he defines it — than with narrative stories. Let’s take look at some of the examples that Holovaty and Sands discuss in their pieces. What is the structured data in each? What are the categories of information and what are some of the values found in those categories?

Data can be anything. Addresses, times, crime reports, sports scores, drink specials, votes cast. In short, data IS who, what, when, and where. Sometimes even how and why.

Now, let’s look at some examples of narrative stories in The Daily Tar Heel.

While each of those stories include compelling anecdotes, they also lack the data needed to support the assertions made in their headlines. By practicing data driven journalism, reporters can make their stories more transparent and more relevant to their audience. That ultimately makes the journalists themselves more credible and more valuable. And those are two things that can differentiate your site from every other on the Web.

Meyer outlines six steps that journalists need to take when dealing with data.

  1. Collect it – This can be done through observation or acquisition of public records.
  2. Store it – Data is often a snapshot in time. But it can be used over and over again.
  3. Retreive it – Listing, searching, etc.
  4. Analyze it – Are there trends over time? Are there predictors of outcomes?
  5. Reduce it – You may have to use statistics, but don’t tell anyone that.
  6. Communicate it – More and more often, this means data visualization. Charts, graphs, maps, tag clouds, etc.

I’d like to add a seventh step for online data publishing:

7. Empower the reader to do their own steps 3-5 – make it searchable and sortable so readers can test their own hypothesis. Then, give them a space where they can report back and discuss their findings.

Spreadsheets

Every journalists should know how to use a spreadsheet. The most common spreadsheet is Excel, but you can also use something like Google Docs for free.

The Reporter’s Cookbook provides an excellent four-part Excel training program, complete with sample data files. Microsoft provides some free online tutorials for Excel 2007. Lynda.com also has visual tutorials for its subscribers.

Excel Vocabulary You Should Know Before Moving On

  • Worksheet
  • Rows (records)
  • Columns (fields)
  • Cell

Basic Excel Tasks You Should Master Before Moving On

  • Enter data (by hand, acquire, scrape)
  • Delete data/Clear data
  • Change column width
  • Formatting cells
  • Insert row
  • Insert column
  • Sorting
  • Filtering
  • Using functions
  • Using Formulas

Now that you’re familiar with Excel, let’s try our hands at doing some basic journalism with a real dataset. Step 1, Meyer says, is to collect data. So let’s go get the 2008 election returns from the N.C. State Board of Elections.

You’ll want to download summary.csv (which actually will arrive in a compressed zip format. More on decompressing files here.) CSV stands for “comma separated values.”

Now that we have this open, let’s see if we can use some basic spreadsheet functions to answer questions that people might have about this data. For example:

  • Did more people vote in the U.S. Senate race or the gubernatorial race? (Use the SUM function)
  • Double check the state’s data. (Use Multiply and Divide to check the raw vote totals and vote percentages)
  • What was the average presidential vote per county? (Divide total votes by number of counties)
  • Which House candidate received the most votes? Who received the least? (video on sorting in Excel)
  • Show just the Republican candidates. (Using filter.)

You should also be sure you understand how to repeat a formula and the difference between absolute and relative cell references in formulas.

Putting Data on the Web

Now that you are beginning to see some of the cool things you can do when you break journalism down in to structured data, it’s time to start thinking about getting that data up on the Web for others to use and see.

Probably the simplest way is to use Google Docs. With the spreadsheets you can easily publish the data as a Web page, or you can give people access to sort and filter the information and even edit it if you’d like.

One of the coolest things you can do with Google Docs is create an online form (tutorial video) that allows your audience to add information to the spreadsheet. These forms can be embedded in any Web page even if you don’t know anything about HTML or programming. Obviously, you wouldn’t want readers changing the number of votes in your election results spreadsheet, but maybe you could allow high school coaches password-protected access to update their game scores. Or restaurant owners could update their weekly specials or seasonal hours.

To see Google Docs (including a form) in action, visit the bottom of Letsbuyanewspaper.com.

Tabelizer is also another simple tool for creating HTML tables out of spreadsheet information. It’s pretty self-explanatory.

Data Visualization

Another cool tool for beginning to visualize data is Swivel. As the site itself says, it’s still a little rough around the edges, but check out some examples of what you can do with it.

Speaking of data visualization, everyone loves an interactive map. The Knight Digital Media Center has a good set of tutorials on how to get your spreadsheet information displayed on a Google map. (Andrew Dunn also has a nice seven-step tutorial for simple use of Google maps.)

Dippity is a fairly simple tool for creating very nice looking time lines.

If you’re feeling really frisky and want to play with a little (very little) Javascript, check out the Simile project to do time lines, maps and other data presentations.

Finally, if you’ve decided at this point that you’re still a word person, you can always satisfy your data visualization needs with a little tag cloud. Shoot, The New York Times does it…

Databases

But Professor, you say, you’re just talking about spreadsheets. I want to do databases. Sure thing, the first thing you need to understand the primary difference between a database and a spreadsheet.

Probably the easiest way to think about the difference is to think about a database as a collection of spreadsheets. These spread sheets are all related to each other, so they’re called relational databases. J-Lab has a nice explanation of relational databases.

But to do very cool things with databases, you need to know some flavor of SQL and a scripting language. Or at least learn Access.

There is one non-programming option that has become popular among journalists who want to deploy online databases. Caspio is an online service that allows you to upload your databases or spreadsheets to their servers and create some very simple Web forms. The lowest cost for the service, though, is $480 a year. Non-profits and students who are willing to pre-pay a year can get the price down to $336 a year. Not cheap, but cheaper than hiring a programmer.

Let me end with a final word about why I’m a strong believer in the value of data driven journalism.

First, it adheres to a notion of objectivity that goes beyond “balance.” Its transparency holds journalists accountable and the accuracy it demands should improve our credibility.

Second, online data driven journalism makes our work more relevant to the audience. It helps get the right information to the right people at the right time with a high level of efficiency.

Third, it is the foundation of something I’m calling sustainable journalism — the idea that you can unearth information once and reuse it many times. Journalists who store and present their reporting as structured data can build deep resources of commodity information. On top of those resources, they can spend more time digging up new information and doing the kind of analysis and trend-spotting that will differentiate professionals from amateurs. Probably the two most prominent examples of this are The New York Times Topics and The Washington Post’s Votes Database that those news organizations incorporate via links in to almost every story.

Looking for More?

  • Statistics Every Writer Should Know (I know, I know. I’m not a numbers guy either…)
  • The Reporters Cookbook (Good list of introductory lessons to spreadsheets, databases and other tools of computer assisted reporting)
  • Syllabus of SMPA 130 at The George Washington University (Makes me wonder why I’m re-inventing the wheel here when The New York Times’s Derek Willis so kindly posts his excellent class right online.)
  • Computer Based Training classes (UNC faculty and staff only)
    • Introduction to Programming
    • Basics of XML Programming
    • SAMS Teach Yourself PHP, MySQL and Apache All in One
Print Friendly, PDF & Email