Tag Archives: data journalism

While gathering data, I learned Google doesn’t know everything

'Google Logo in Building43' photo (c) 2010, Robert Scoble - license: http://creativecommons.org/licenses/by/2.0/

Simply collecting the data for this story about the Berlin Police Department is more complicated than I first suspected. 

It’s my first data journalism story and I wanted something challenging — a project where I would learn — but something doable. Studying my hometown police department’s daily blotter for the month of January seemed reasonable and interesting.

In last week’s post, I told you Google’s search engines turned up valid entry after valid entry in its results. At first, it was easy: I went from one .pdf to another, downloading the files to my computer. But after downloading the 13th .pdf, I found out Google did not bring up all the results.

The last page of the search results had three documents that were irrelevant. I needed something more that a search engine to get all this data.

The last page of the search results had three documents that were irrelevant. I needed something more that a search engine to get all this data.

At first, I thought it was the police department, so I waited a few days before running the search again. But the same search a few days later on Jan. 23 gathered the same results.

That’s when I decided to manipulate the URL of one of the documents that was there in hopes of finding documents not retrieved by Google.

I started with the URL to the daily police blotter for Jan. 7:   http://www.berlinpd.org/images/pdfs/DAILY%20BLOTTER%201-07-2014.pdf

Since the date is in the address, I simply changed the date to a document I didn’t have. The document loaded; I downloaded it and I kept changing the date until I got an error message.

Apparently, the police department did not upload that document to the server at the time that I looked for it.

Apparently, the police department did not upload that document to the server at the time that I looked for it.

Here’s the lesson: A search engine is a good starting point when looking for information, but it has limitations.

The next step was converting the data into something I could use. 

You can’t use data in .pdf format because .pdfs are designed for reading and publishing. You have to have it in a .xls format, something malleable so that it can be played with, measured and counted.

Reading about data journalism, I learned there are ways to convert .pdfs into something usable, but by the sounds of it, a person needed to know a bit of code.

I instead opted for the easy way out and Googled “convert pdf to excel” and found a few websites that do it for free.

After combining 22 .pdf documents into one with merge.smallpdf.com and converting the 37-page document into an Excel workbook with pdftoexcelonline.com, I had an Excel file I could use.

The only problem? The .pdf converter made each page of the .pdf into a separate sheet in the Excel workbook. After trying to find a quick solution online today, I simply copy and pasted each sheet into a “Master List” in the workbook.

It probably needs copyediting, but I have three weeks worth of data that I can start exploring.

Tagged , , , ,

How I’m teaching myself data journalism

'I Love Spreadsheets' photo (c) 2012, Craig Chew-Moulding - license: http://creativecommons.org/licenses/by-sa/2.0/

Lol. Not yet.

Last week, I discovered myself staying up late and getting a little too excited over Excel spreadsheets.

The amount of nerdiness disgusted me at first, but I’ve come to terms with it. Data journalism jobs are in demand, and they fit into an evolving world of 21st Century media, of Wikileaks, PGP encryption, social media and SEO rankings.

Database journalism, from what I understand, is the process of analyzing data to find stories that serve the public interest. To do the job effectively, journalists need to learn a whole new toolbox of skills: Microsoft Excel, code, a bit of statistics and, *gasp* math.

But after the learning curve comes the ability to present better information to the public. Sometimes, journalism feels like parroting the he-said, she-said of politics and business.

While Mark Twain would argue “there are three kinds of lies: Lies, damned lies, and statistics,” statistics and numbers bring a logical weight to news stories, a grounding.

Last week, I googled “data journalism.” The first hit was this free e-book, created by The European Journalism Centre and the Open Knowledge Foundation.

After reading how data journalism is important for 21st Century journalism, how the marriage of the press and data has already changed the world, I skipped to the pith of the book — a step-by-step guide to doing data journalism.

And this is where I decided to get involved. It’s one thing to read how to do something, but then the skill is then mostly forgotten, unpracticed. It’s another to actually go out and do it.

So I lined up a possible project analyzing data I get on my hometown of Berlin, Conn.

The first step was to get some data. 

I searched by file type (a .xls document is ideal) and I narrowed my search down until I was searching a specific website. Finally, I found something promising when I typed “2014 site:berlinpd.org filetype:pdf” into Google.



I found a promising vein of information on the Berlin Police Department’s website. They publish their daily activity blotter to the Internet in a .pdf document.

I figure I could collect data for a time  and then quantify it, figuring out the most dangerous streets, what the police do on an average day, find out when the department was most busy.

There are some challenges, like converting .pdf documents to .xls pages, filling in missing data and actually making sense of it all.

Meanwhile, I will keep you updated.

P.S. Are a data journalist reading this post? Could you give me any advice? Maybe I missed a really good resource. Let me know in the comments below, or through Twitter. My handle is @jcksndnl.

Tagged , , , , , , , , , , ,