While gathering data, I learned Google doesn’t know everything

'Google Logo in Building43' photo (c) 2010, Robert Scoble - license: http://creativecommons.org/licenses/by/2.0/

Simply collecting the data for this story about the Berlin Police Department is more complicated than I first suspected. 

It’s my first data journalism story and I wanted something challenging — a project where I would learn — but something doable. Studying my hometown police department’s daily blotter for the month of January seemed reasonable and interesting.

In last week’s post, I told you Google’s search engines turned up valid entry after valid entry in its results. At first, it was easy: I went from one .pdf to another, downloading the files to my computer. But after downloading the 13th .pdf, I found out Google did not bring up all the results.

The last page of the search results had three documents that were irrelevant. I needed something more that a search engine to get all this data.

The last page of the search results had three documents that were irrelevant. I needed something more that a search engine to get all this data.

At first, I thought it was the police department, so I waited a few days before running the search again. But the same search a few days later on Jan. 23 gathered the same results.

That’s when I decided to manipulate the URL of one of the documents that was there in hopes of finding documents not retrieved by Google.

I started with the URL to the daily police blotter for Jan. 7:   http://www.berlinpd.org/images/pdfs/DAILY%20BLOTTER%201-07-2014.pdf

Since the date is in the address, I simply changed the date to a document I didn’t have. The document loaded; I downloaded it and I kept changing the date until I got an error message.

Apparently, the police department did not upload that document to the server at the time that I looked for it.

Apparently, the police department did not upload that document to the server at the time that I looked for it.

Here’s the lesson: A search engine is a good starting point when looking for information, but it has limitations.

The next step was converting the data into something I could use. 

You can’t use data in .pdf format because .pdfs are designed for reading and publishing. You have to have it in a .xls format, something malleable so that it can be played with, measured and counted.

Reading about data journalism, I learned there are ways to convert .pdfs into something usable, but by the sounds of it, a person needed to know a bit of code.

I instead opted for the easy way out and Googled “convert pdf to excel” and found a few websites that do it for free.

After combining 22 .pdf documents into one with merge.smallpdf.com and converting the 37-page document into an Excel workbook with pdftoexcelonline.com, I had an Excel file I could use.

The only problem? The .pdf converter made each page of the .pdf into a separate sheet in the Excel workbook. After trying to find a quick solution online today, I simply copy and pasted each sheet into a “Master List” in the workbook.

It probably needs copyediting, but I have three weeks worth of data that I can start exploring.

Advertisements
Tagged , , , ,
%d bloggers like this: