Parsing structured data within PDF documents with Apache PDFBox

PDF continues to be a popular document publishing format because users see them as the digital equivalent of paper documents. Unlike websites, often what you see on the PDF will be exactly how it will be printed on a physical page, with the added benefits of easily distributable files and near-ubiquitous support of software able to read this format on almost any standard digital device.

However, when information, especially structured data, is contained within a PDF document and one wishes to extract that content, the format becomes quite difficult for developers to interact with.

In this post, I outline a real-world example of parsing a large PDF file that contains repeated tables of data. I show how the raw text can be extracted and then detail much more low-level control over the text characters positioned within the pages. I also touch on the actual mechanics of working through a problem like this - using tools like Excel to explore and analyze both the nature of the PDF, as well as the vagaries of the data itself.

BCBC Results Snippet

Read on →

Solved: When the Maven Deploy Plugin silently fails to deploy

At SnapLogic, we recently noticed that a particular build job that was responsible for deploying build artifacts to a Nexus repository via Maven had suddenly stopped, well, deploying. What was odd was that no error of any kind was being communicated, even in DEBUG mode.

Examining the history of the repository, what had changed was the addition of a new module (“slbugs”) to the existing multi-module Maven build. This module’s reposibility was to run some code health checks using Google’s error-prone static analysis tool to catch some programming mistakes that our developers occassionally made that had a negative effect at runtime.

What was different about this module versus the others was that it did not use the root POM as a parent (as it was sufficiently different from the other more product-focused modules). The other modules were also configured to use the deployAtEnd parameter of the parent’s maven-deploy-plugin plugin configuration.

The problem was that each of the product modules would log that they would be deployed at the end of the build, but after the last module ran its deploy phase, nothing would happen - no uploading of artifacts would be attempted, no warnings or debug messages logged to explain the inaction, and the build would just end with a SUCCESS status.

The solution turned out to be related to the wonderful world of Maven classloaders.

Read on →

Building a Google Chrome Extension (Keyboard Shortcuts, Copying to the Clipboard, and Notifications)

I recently had the quite enjoyable and productive experience of writing Pipeline Linker, my first Google Chrome Extension.

As part of my work with SnapLogic, an enterprise integration platform-as-a-service (iPaaS) provider, I often have to navigate to Pipelines (hosted graphical representations of integrations) across multiple environments (both internal and customer-facing), client accounts, and project folders.

In turn, as the manager of my team, I regularly direct team members to Pipelines that require attention through email, Hangouts, JIRA, Zendesk, and Slack.

For whatever reason, our product had not provided an easy way of linking directly to these pipelines (you had to switch to a different tab and perform a search, before right-clicking on a table entry and copying the link).

Over a weekend, I was pleasantly surprised by the ease and speed I was able to learn and implement a Chrome Extension that would address this gap, as well as add some features that I, in particular, value.


Read on →

Everything you ever wanted to know about SSL (but were afraid to ask)

Or perhaps more accurately, “practical things I’ve learned about SSL”. This post (and the companion Spring Boot application) will demonstrate using SSL certificates to validate and authenticate connections to secure endpoints over HTTPS for some common use cases (web servers, browser authentication, unit and integration testing). It shows how to configure Apache HTTP server for two-way SSL, unit testing SSL authentication with Apache’s HttpClient and HttpServer (Java), and integration testing a REST API within a Spring Boot application running on an embedded Tomcat container.

There are lots of ways for a client to authenticate itself against a server, including basic authentication, form-based authentication, and OAuth.

To prevent exposing user credentials over the wire, the client communicates with the server over HTTPS, and the server’s identify is confirmed by validating its SSL certificate. The server doesn’t necessarily care who the client is, just as long as they have the correct credentials.

An even higher level of security can be gained with using SSL certificates for both the client and the server.

Two-way SSL authentication (also known as “mutual authentication”, and “TLS/SSL with client certificates”) refers to two parties authenticating each other through verifying provided digital certificates, so that both parties are assured of the other’s identity.

Read on →

Spring Social Bootstrap: Create REST API SDKs and CLIs that can Record and Replay HTTP requests

I joined SportsLabs (then still under the Silver Chalice brand) way back in 2011 as one of its earliest employees and the first engineer.

We started work on envisioning and building the Advanced Media Platform - a system to ingest, process, transform, distribute, and stream sports, news, social, and media content to create market leading mobile, web, and social products for clients such as Samsung, the University of Notre Dame, the ACC, the College Football Playoff, IMG College, the Mountain West and Campus Insiders, among others.

Since then, SportsLabs has consumed data from dozens of sources including STATS LLC, Twitter, and Ooyala, but also from proprietary systems that were never foreseen as integration points.

Data providers’ APIs use combinations of JSON, XML and/or CSV. Some are spec-compliant, others are not. Some rely heavily on query parameters, while others favor HTTP headers. Some API providers use OAuth 2.0 plus API rate limits, while others have rolled their own security solutions. Some integrations were with partners willing to work with us on evolving their web services. Others were with competitors who were not motivated to make things easy.

This plethora of ways to configure, consume, learn from, and integrate with APIs led us to create Spring Social Bootstrap, a family of projects intended to aid creating and managing API clients for many of the above scenarios.

Spring Social Bootstrap is comprised of the following:

Read on →