We interrupt this blog

As a Phillies fan, this ruins my week.

Meanwhile, we get a guy whose name, literally translated, means Happy Pete.

In defense of “Raw Data”

Raw data get a bad rap. They’re told they mean nothing, unless someone goes in and adds”context.” If I were data, I’d hate context. Always stealing my thunder. One day data are told they’re the future of journalism, and the next day journalists are complaining about how they’re used.

It’s a tough life, being data. But I think the rollercoaster ride they’ve been through is unwarranted. Why? Because posting data is the 21st century equivalent of what we’ve been doing all along. As in, ask your grandfather how he ran his newspaper, and you’ll start to see some parallels.

If you, like me, don’t have a grandfather who new jack about the news, let me offer up this paragraph from “The Elements of Journalism” (emphasis my own):

“The individual reporter may not be able to move much beyond a surface level of accuracy in the first story. But the first story builds to a second, in which the sources of news have responded to mistakes and missing elements in the first, and the second to a third, and so on. Context is added in each successive layer.

I’m a 20-something who’s known computers my whole life, so I can’t speak from experience on this one. But I’ll bet good money that the progression laid out in the above passage played out in but a percentage of stories — that some simply stopped at the first story. There was no added “context”; there were no complaints from sources.

I don’t think that’s bad. Carl Bernstein talks about something called “the best obtainable version of the truth.” And in many cases, thats what a dataset can offer.

Take public salaries (Please! Really folks, I’ll be here all week). Dozens of newspapers have posted the salaries of their local County, school district, Board of Equalization, PTA, Myrtle’s knitting group, and whatever else they can get their hands on. CAR practitioners have bemoaned the move, saying it’s just not journalism.

Here’s why I take issue. When I worked in the IRE Resource Center, I had the thankless task of processing and reading hundreds of “investigative” stories from all around the globe. Most of them came in around the time of the annual IRE contest, and every year there would be a slew of stories with headlines like, “How much do they make?” It’s the easiest CAR story in the book, right? Find your local municipality’s salary database, ask some questions to fill out a story, and score a talker.

In about 70 percent of the stories I read, that was the methodology. In its entirety. As these were contest entires, I can confidently say that there were no follow ups to those stories; they were meant to stand on their own.

Were those journalism?

Many will put those stories on a journalism continuum. “Yes, it was journalism, just not very good journalism.” That argument assumes that, just because the information was laid out in story form, journalism took place.

You can’t have it both ways. If a crappy salary story is journalism, then a raw salary database is journalism.

Me? I’d argue it’s somewhere in the middle. Journalistic, perhaps; done with the intent of providing context to the community, rather than context to the numbers.

It’s not a catch-all argument. Few are. But I think we need to admit that, sometimes, data on its own can achieve the same outcome as a story, especially when “findings” in analysis are lackluster. Dare I say it, Wire fiends? In some cases, I even think it allows us to achieve more with less.

EveryBlock, straight from the horse’s mouth

I look at Everyblock, and say, “This is cool.” But I don’t live in Chicago, San Francisco or New York, so that’s really about as far as I get.

I ask a friend of mine in Chicago — a non-geek, non-journalist, mind you — what he thinks.

“that site is totally awesome…look what happened a few houses down from me on wednesday! http://chicago.everyblock.com/news-articles/by-date/2008/1/23/558835/

The link leads to one of the greatest police blotter entry I’ve ever seen.

So there you have it. Everyblock is good. People like it and can use it without hurting themselves on all the sharp data.

I still think there’s a “What is journalism” argument lying just under the surface. The good part is, well, people have been arguing about that for a hundred years, and the definition hasn’t really changed. More on that topic later.

links for 2008-01-29

links for 2008-01-24

links for 2008-01-23

links for 2008-01-22

links for 2008-01-16

links for 2008-01-10

Stones, birds and the ghettoization of data

First Bird: I’ve started a blog at work, one complete with a nerdy picture and a couple posts about spreadsheets. I don’t recommend looking at it now, because I haven’t yet found a way to act professionally and run a blog. But, if you wait until 2009, you can take a look here.

Segue: Tomorrow, I’m going to point readers to Matt Waite’s recent lampoon of the Data Desk concept. I’m curious to see what they think.

Second bird:  I think he makes some good points. He has a habit of doing that. Rote publication of data shouldn’t be the only way we do what we do. But in some cases, it works. Really well.

Take, for instance, the position of the News-Leader a few weeks back. The City had just been lambasted in a state audit, accused of mismanaging the books in a handful of different ways. Use of city-issued purchase cards, the audit said, appeared to be out of hand. I had the database. So I posted it.

Readers ate it up. They searched through the data, finding interesting purchases. They e-mailed us: “Why did that snake XXXX spend $XXX at XXXX? What could he possibly have needed that for?” Our reporting was better for it. We’re still writing stories, and, when it makes sense, I’m still posting relevant datasets.

They go hand-in-hand with stories. I get miffed if a data-driven story doesn’t point to the related data. Similarly, I get miffed if I have to post data without the context a story offers.

I’m excited about the opportunity to add more value to data in the future. With the P-Card info, for instance, it would have been nice to be able to rank the highest spenders, then allow people to drill into each person’s purchases. It would have been nice to allow for dynamic grouping by supplier, cardholder, or department. It would have been really nice to connect the cardholder names with our employee salary database, and allow readers to go to town.  But we’re not there. Yet.

At the same time, there is worth to posting the data in basically raw format. Though J-School may have taught us otherwise, our readers are not mouth-breathing Cro-Magnons. Sometimes a list of votes from a Council meetings is more important than the narrative of how heated the debate was on, say, the dumpster debate. Similarly, opening up a year’s worth of credit card debits is more useful than simply plucking out the big spenders.

As to the issue of site design: Amen. Putting all the databases together makes as much sense as sticking all the blogs together. You don’t read blogs because they’re blogs. (Exhibit A). You read them because the topic is interesting and/or the blogger is incredibly good looking (Exhibit B).

Ahem.

Finally: Site traffic. Our site has seen traffic increase exponentially. We’re now pulling anywhere from 3-4 percent of all unique hits to the site, and we have the highest time spent of any area. I think it has to do with making sure readers know it’s relevant. When we ran the P-Card story, we pointed them to the data. We ran a story about the most oft-convicted, then pointed readers to our conviction DB. We wrote about gas pump tests, then allowed readers to see how it affected them.

Again, I’m not saying Matt’s wrong. This is fine for now, but things will change. Still, this is  a starting point. Our success has led to more of an investment, both in time and resources. I hope that, at other papers, data successes translate into real investment, too.