The body PoliTech

Illustration by the author.

Hello again! If you managed to read my other articles on the topic of civic involvement through searching out and following state policy proposals online, congratulations you’re officially a geek or at least really, really bored.

If you waited to read the last article just to see the point of all this, that is fine too. I started this project to show you that there are tools we can use to automate some of our research. Even the best designed web pages are not easy to do research on. Techniques like this can be used on websites that serve up files for download.

In my previous article I talked about the optical character recognition program called GhostScript and how it did not work well for what I was doing. I found another tool since then that appears to do so much better. If something like this happens to you, do not let it bother you. This sort of thing happens to people all the time in the computer world. The problem occurs because there are sometimes so many programs that do similar things, or even versions of the same program, that it can be difficult at times to find what works best. You can search the Web one day, find a tool that looks good, search another day and find three more that work better.

This new program is called PDFtoHTML and as its name implies converts PDFs into HTML files which can be viewed either in a web browser, or even better, searched through using batch search programs. Linux can be a great tool for people who want to do a lot of the same thing repeatedly, also called batch processing. This is perfect for the analysis that I want to do. I use a simple command to gather the PDFs but then I use a batch processing script which goes through each PDF and converts it into HTML, this script is a single command line:

for f in *.pdf; do pdftohtml -c $f; done

In English, the line reads every file (f) that ends in .pdf (*.pdf), and runs the command pdftohtml on the the file ($f) and once you do that, stop. The complex flag (-c) tells pdftohtml that I want to retain the position of the text from the PDF at the same position in the HTML document. This will help me by processing columns of text separately, this is something I will use later.

Before we look at all these files, I run a batch process which removes some special characters. This tool is called Stream Editor (sed) which is one of the many command line applications which have been around since the beginning of Unix and is similar to using Find and Replace in Microsoft Word but can be combined with a for loop, like the one shown earlier, to work on any number of files. This process not only makes the files more readable but also more searchable.

What we will be searching are the 132 journal entries from the Wisconsin State Senate, the 99 journal entries from the Wisconsin State Assembly, and the 867 proposed bills and resolutions from both houses of the state legislature, totaling 8496 total pages.

Let us pick on State Representative Krug and State Senator Testin. Using the search tool called grep, I can see that Representative Krug is listed as sponsoring or co-sponsoring 180 proposals while Senator Testin is listed on 129. With only a few lines of typing, we could catalog every proposed bill or resolution in which they are listed. Both Krug and Testin show up in the journal record, amazingly, 92 times each.

While these statistics are interesting, they are not particularly useful by themselves. So, let us look for some fun words. Searching only in the proposed bills and resolutions, 87 files report that COVID-19 or coronavirus was mentioned. That is no surprise considering how much the pandemic has affected everything. There were 23 proposals directly relating to the pandemic regarding face coverings and the federal tax mandate. There were 32 proposals related to vehicles, 20 which mentioned the environment, and 16 mentioned hunting.

Doing random word searches like this can be interesting and informative, but really it is the tip of the iceberg for running these files to find out more useful information. With a little extra programming you could find out who voted for what, how many days a legislator was on the floor to vote, who proposed what sort of bills and so much more.

Using computer automation and batch processing, the general public can gather more information about what is being voted on in the halls of the legislature. These tools, once built, can provide powerful tools for advocates, lobbyists, candidates and anyone who just wants to keep tabs on what their government is doing. Using tools like these increases accessibility for all voters and should be implemented by as many people as possible in every state of the union. Political parties use tools like this on voters all the time and I think that it is only fair for voters to finally turn the tables and use these tools on them to hold them, their candidates, and our elected representatives accountable.