I'm Robin Howlett, Director of Engineering at SnapLogic, the Intelligent Integration Platform. Married to Sarah Howlett. Father of twins. Boulder, Colorado-based. Made in Ireland.

Latest Articles

Parsing structured data within PDF documents with Apache PDFBox

PDF continues to be a popular document publishing format because users see them as the digital equivalent of paper documents. Unlike websites, often what you see on the PDF will be exactly how it will be printed on a physical page, with the added benefits of easily distributable files and near-ubiquitous support of software able to read this format on almost any standard digital device.

However, when information, especially structured data, is contained within a PDF document and one wishes to extract that content, the format becomes quite difficult for developers to interact with.

In this post, I outline a real-world example of parsing a large PDF file that contains repeated tables of data. I show how the raw text can be extracted and then detail much more low-level control over the text characters positioned within the pages. I also touch on the actual mechanics of working through a problem like this - using tools like Excel to explore and analyze both the nature of the PDF, as well as the vagaries of the data itself.

BCBC Results Snippet

Read on →

Solved: When the Maven Deploy Plugin silently fails to deploy

At SnapLogic, we recently noticed that a particular build job that was responsible for deploying build artifacts to a Nexus repository via Maven had suddenly stopped, well, deploying. What was odd was that no error of any kind was being communicated, even in DEBUG mode.

Examining the history of the repository, what had changed was the addition of a new module (“slbugs”) to the existing multi-module Maven build. This module’s reposibility was to run some code health checks using Google’s error-prone static analysis tool to catch some programming mistakes that our developers occassionally made that had a negative effect at runtime.

What was different about this module versus the others was that it did not use the root POM as a parent (as it was sufficiently different from the other more product-focused modules). The other modules were also configured to use the deployAtEnd parameter of the parent’s maven-deploy-plugin plugin configuration.

The problem was that each of the product modules would log that they would be deployed at the end of the build, but after the last module ran its deploy phase, nothing would happen - no uploading of artifacts would be attempted, no warnings or debug messages logged to explain the inaction, and the build would just end with a SUCCESS status.

The solution turned out to be related to the wonderful world of Maven classloaders.

Read on →

Visualizing inefficient multi-ticket horizontal wagering tickets

Between PPs, result charts, and wagering ticket summaries, the way that horse racing information has been presented to regular fans hasn’t changed very much over the years. In this post, I use data visualizations to try to increase comprehension of advanced wagering advice.

It was May of 2017 when, deep into the development of Handycapper, I took a break and browsed Twitter.

I saw a tweet from Darin Zoccali (@atTheTrack7) referencing a column he published at DRF.com titled, “Trust is a must or your handicapping game is a bust” (archive link).

In this column, Darin explained how a well-known wagering professional, @InsideThePylons (ITP), had called out inefficiencies and mistakes in the construction of Pick 4 tickets Darin had posted in late 2016. Darin reported that this advice had helped him wager smarter, increase churn, and win far more when his opinion was correct.

Intrigued, I wanted to find this conversation, and, with a little Twitter Search-fu, I was able to locate the Pick 4 in question.

As I tried to understand the finer points of ticket construction @InsideThePylons was making, I remember making a mental note that this kind of feedback would be a great candidate to potentially benefit from data visualizations to aid understanding, for the average racing fan, of the interaction between Darin and ITP.

dataviz

Read on →

Building a Google Chrome Extension (Keyboard Shortcuts, Copying to the Clipboard, and Notifications)

I recently had the quite enjoyable and productive experience of writing Pipeline Linker, my first Google Chrome Extension.

As part of my work with SnapLogic, an enterprise integration platform-as-a-service (iPaaS) provider, I often have to navigate to Pipelines (hosted graphical representations of integrations) across multiple environments (both internal and customer-facing), client accounts, and project folders.

In turn, as the manager of my team, I regularly direct team members to Pipelines that require attention through email, Hangouts, JIRA, Zendesk, and Slack.

For whatever reason, our product had not provided an easy way of linking directly to these pipelines (you had to switch to a different tab and perform a search, before right-clicking on a table entry and copying the link).

Over a weekend, I was pleasantly surprised by the ease and speed I was able to learn and implement a Chrome Extension that would address this gap, as well as add some features that I, in particular, value.

cropped

Read on →

Everything you ever wanted to know about SSL (but were afraid to ask)

Or perhaps more accurately, “practical things I’ve learned about SSL”. This post (and the companion Spring Boot application) will demonstrate using SSL certificates to validate and authenticate connections to secure endpoints over HTTPS for some common use cases (web servers, browser authentication, unit and integration testing). It shows how to configure Apache HTTP server for two-way SSL, unit testing SSL authentication with Apache’s HttpClient and HttpServer (Java), and integration testing a REST API within a Spring Boot application running on an embedded Tomcat container.

There are lots of ways for a client to authenticate itself against a server, including basic authentication, form-based authentication, and OAuth.

To prevent exposing user credentials over the wire, the client communicates with the server over HTTPS, and the server’s identify is confirmed by validating its SSL certificate. The server doesn’t necessarily care who the client is, just as long as they have the correct credentials.

An even higher level of security can be gained with using SSL certificates for both the client and the server.

Two-way SSL authentication (also known as “mutual authentication”, and “TLS/SSL with client certificates”) refers to two parties authenticating each other through verifying provided digital certificates, so that both parties are assured of the other’s identity.

Read on →