I’m Old, Part LXXXII: Making Friends

At Atalasoft, we frequently tapped into local colleges for filling paid internships and we ended up hiring a fair percentage of those students once they had finished their degree. One of our first interns was Sean McKenna who came to us from UMass Amherst. At the time he was brought on, the company was barely single digits and crammed into an office in an old mill building in Northampton.

One day, we were having a conversation about beer pong and other drinking games. I discovered from Sean that at UMass, they called it ‘Beirut’ instead of beer pong. Sean and Dave Cilley started getting into a friendly argument over the point of drinking games in general and beer pong specifically. At one point, Sean said, “Beirut isn’t about drinking; it’s about making friends.” Dave immediate retorted deadpan, “I don’t need friends.” He immediately put the argument to bed. All of us doubled over in laughter except Dave who walked away.

This event entered into our collective lore and would come up every year or so.

PDF Redaction

As of January, 2019, lawyers for Paul Manafort released a “redacted” PDF that allowed one to easily read the redacted text. I will not speculate as to how this could have happened beyond the chestnut, “Never ascribe to malice that which can adequately be explained by incompetence”. Instead, I will go into detail as to how difficult a problem redaction is to solve properly in PDF beyond pilot error.

To begin with, let’s start with the elements of content of PDF. PDF was designed to represent all that could be represented on a printed page. There are three basic elements that can appear in a PDF: paths, bit-mapped images, and text. Paths can be either Bezier curves or rectangles (and in fact, rectangles are shorthand for paths). Paths can be stroked, filled, both, or used as a mask to clip out other elements.

These elements are represented in a postfix program within the file. One technique to display a page is to execute the program and collect all the elements into a display list which can then be rendered onto a screen on into print.

The problem here is that each of those three elements can contain something that is visually text. To be able to perfectly redact a PDF, there are two approaches. The first is to turn the page into just an image and paint over the text to be redacted and create a new document with only the image on the page. This a cheesy approach and it works, but there is a cost. Any of the text that was actual text is no longer text and can’t be selected or indexed. Furthermore, the final document is likely to be substantially larger than the original.

To do this in a non-cheesy way you need to handle each of the elements. Why? Because any of the three elements could render as text that is readable to a human. For example, an image could contain text (or a face). A set of paths could be a logo or the shapes of letters. To handle text, you need to be able to iterate over all the text in a page and create an equivalent program on the page that no longer includes any text within the area to be redacted. This is a tricky problem on its own, especially when text elements span regions to be redacted. Next, for images you need to be able to decode all the possible image formats with the PDF: JPG, LZW, CCITT, RLE, JPEG2000, and JBIG. The latter two are non-trivial to decode. Then replace the image with a new one with the areas painted over. Finally, there are path items. Ideally, you would remove any paths that intersect redaction areas, but that gets tricky because changing the paths that are only partially occluded is a very hard problem to do well.

In addition to these basic elements, there are others including composite objects, layers, annotations – any of which can be manipulated to appear as text or other information that should be redacted.

When I was working at Atalasoft on PDF tools, I considered the task of redaction and chose to pass on it. It lost us a sale or two, but I would rather lose a sale than get blamed for an incorrectly redacted document.

The commercial version of Adobe Acrobat has redaction tools built in and these do the job quite well. Best to depend on that for the time being.

I’m Old, Part LXXXI: Notes

When I worked for Newfire in that late 90’s, there were a number of things we did that were clearly designed to make the company more enticing for purchase. We had continuous integration, source code control, QA, marketing, etc.

In addition, there was the issue of intellectual property. We were implementing some really interesting things in terms of 3D code and a bunch of it was novel. We had patent attorneys come in and interview us and we applied for a stack of patents on our technology. In addition, my boss Marty gave every engineer a notebook with two very special criteria: they were bound and the pages were numbered. This would make it clear if the notebooks had been altered.

If we came up with any good ideas, we were encouraged to write them down and with the date and if they were particularly good, we had to put the book in front of another engineer who signed and dated the entry as a witness. Although it is possible to forge the books, this was supposed to provide a modicum of protected against intellectual property issues.

My book had a bunch of inventions in it that I really enjoyed. One was a process to turn a polygon into a minimal set of triangles, the goal being to turn text into meshes or indexed face sets. Another was a programming language built on finite-state automata for defining object behaviors in games. It was designed to be easy to read, interoperate with our game engine, and to be trivial to JIT compile.

Years later, at Atalasoft, I carried this over. I ordered a stack of notebooks and whenever we brought in a new engineer, I game them a notebook with the same instructions I was given.

What I think I really liked about the process was seeing how each engineer used the notebooks. It was an interesting reflection on their working styles. Some were planners and very much filled their notebooks with everything. Some were journalers and just wrote down a few bullet points for the day. Some were save game slots: wrote just enough to make it easier to pick up the next day. Some didn’t really use the notebooks at all. I was totally OK with all of those approaches.