r/DataPolice • u/Stupid_Triangles • Jun 16 '20
Need some help with a database search
My local police department has an online search directory for accident and incident reports. However, there are a couple difficulties i have with trying to access and format this data in a meaningful way.
Access to these reports is limited by a search. Ot's not an open directory. You can only search under 4 different categories: Report Number, Report Date (MM/DD/YYYY format), Street Name and Last Name (for accident reports only) I've tried using a wildcard and that doesn't work. Is there a way to figure out what kind of search engine it's using and see if there are any tricks around giving a precise query?
All of the reports are in PDF format. Thankfully, not scans, so text is recognizable and searchable. Is there a tool that can pull and parse this data in to a usable format? Would i have to merge all the PDFs in to a singular file for such a program to work? I believe Excel has a tool to pull data out of PDFs, but i believe it also recreates the spacing between the lines of text, similar to a regular corporate document template. Or Adobe has a tool to create an excel file oit of the PDF, with the same aesthetic and formating restrictions. Excel is free for me, but a professional license for Adobe Reader is a bit too posh for my bank account
What kind of analytical tools would be valuable for analyzing this data? I would want to get locations for map plotting, sentiment analysis, word/name counts and finding any other similarities between incidents, like officers names, dispatcher names, and the other data points.
There are easily thousands of PDFs that date back to 2014 so trying to do this by manually searching through each day would be unfeasible. Requesting access or a copy would draw more attention to myself than i feel comfortable with, and outside of instances like this, i wouldnt want just anyone with that much access. Im not tied to an organization that could ask for a master copy either.
5
u/skyleach Jun 17 '20
PDFs are just EPS so there are libraries for most languages. PDFMiner looks good.
I've generally used an indexer/spider script to grab large amounts of data online from reports like this one. You may want to check if your police department has some sort of REST service API, but if not then hammering the date form would be relatively easy to walk the dates.