As of January, 2019, lawyers for Paul Manafort released a “redacted” PDF that allowed one to easily read the redacted text. I will not speculate as to how this could have happened beyond the chestnut, “Never ascribe to malice that which can adequately be explained by incompetence”. Instead, I will go into detail as to how difficult a problem redaction is to solve properly in PDF beyond pilot error.
To begin with, let’s start with the elements of content of PDF. PDF was designed to represent all that could be represented on a printed page. There are three basic elements that can appear in a PDF: paths, bit-mapped images, and text. Paths can be either Bezier curves or rectangles (and in fact, rectangles are shorthand for paths). Paths can be stroked, filled, both, or used as a mask to clip out other elements.
These elements are represented in a postfix program within the file. One technique to display a page is to execute the program and collect all the elements into a display list which can then be rendered onto a screen on into print.
The problem here is that each of those three elements can contain something that is visually text. To be able to perfectly redact a PDF, there are two approaches. The first is to turn the page into just an image and paint over the text to be redacted and create a new document with only the image on the page. This a cheesy approach and it works, but there is a cost. Any of the text that was actual text is no longer text and can’t be selected or indexed. Furthermore, the final document is likely to be substantially larger than the original.
To do this in a non-cheesy way you need to handle each of the elements. Why? Because any of the three elements could render as text that is readable to a human. For example, an image could contain text (or a face). A set of paths could be a logo or the shapes of letters. To handle text, you need to be able to iterate over all the text in a page and create an equivalent program on the page that no longer includes any text within the area to be redacted. This is a tricky problem on its own, especially when text elements span regions to be redacted. Next, for images you need to be able to decode all the possible image formats with the PDF: JPG, LZW, CCITT, RLE, JPEG2000, and JBIG. The latter two are non-trivial to decode. Then replace the image with a new one with the areas painted over. Finally, there are path items. Ideally, you would remove any paths that intersect redaction areas, but that gets tricky because changing the paths that are only partially occluded is a very hard problem to do well.
In addition to these basic elements, there are others including composite objects, layers, annotations – any of which can be manipulated to appear as text or other information that should be redacted.
When I was working at Atalasoft on PDF tools, I considered the task of redaction and chose to pass on it. It lost us a sale or two, but I would rather lose a sale than get blamed for an incorrectly redacted document.
The commercial version of Adobe Acrobat has redaction tools built in and these do the job quite well. Best to depend on that for the time being.