Follow the policy

Image courtesy of Will Hascall

In an earlier article titled For and by the People we discussed several of the ways you can keep track of what the state legislature is doing. In this installment, we will look at a way we can use the information which is provided by these services and how they could be improved to increase access to all voters. At times, this information will get a little technical. In case you are not interested in exactly how things work, I will keep things short.

The computer tools described here run on a low-cost single board computer running a version of Linux. I use this combination because it is a combination I know well as a tinkerer who first started using Unix, the father of Linux, more than 30 years ago. If you run Windows or macOS, there may be way too much of the same on those systems as well.

The first thing I do is download all the state journal pages, and every single proposed bill or resolution from the state. It helps to have enough storage on your computer because that is a lot of PDF files. For all proposed resolutions and bills as of the morning of this writing, that is 760 PDFs totaling 80 megabytes (MB) of disk space. To automatically download every PDF, or every PDF that you do not already have, I use the tool wget. You can tell it to only download the files that you want, so in my case I tell it to download any file that ends in “pdf” if I do not already have it. These files tend to be small so this goes really quickly. This morning’s update of 40 files, 2.2MB, took only 9 seconds.

For this exercise I will focus only on proposed bills and resolutions. You have a bunch of files, which are great if you want quick and easy access to a bill you happen to know the designator for, such as SB134 or AB71. But what if you don’t know which file contains what you are looking for? Well, I certainly do not know of a way to use Adobe Reader to search multiple PDFs, but there are a few tools you can use to make this possible. First, you need to use Optical Character Recognition (OCR) to pull the text from the PDFs and put them into text files. In Linux you can do this easily with a program named ghostscript. Using a command which loops through all the PDFs, it pulls out the text from each page and stores it in a plain text file. This procedure is not perfect due to the way these PDFs are formatted but it does a pretty good job overall.

A note about the formatting: If you open one of the files I link to above, look at the spacing between words and sometimes even the spacing of letters within words. They are not consistent. Visual readers can skim passed most of these irregularities without losing the flow of the sentence, however OCR programs are not always programmed to be so flexible. People who use screen readers, which turn text on the screen into computer generated speech, will hear strange pauses, single letters or incomplete words which can turn even a simple sentence into incomprehensible babble really quickly.

So this does not get too long, I will stop here. Next time we will explore ways to search text files to locate interesting trends from all these files we downloaded.