When I started at Atalasoft, there were a number of features that were in broad use, but needed love and attention to grow. We had a small PdfEncoder class written by Glenn. It could take a single image and turn it into a single page PDF. It had some very simple options that would allow you to set the page size, center the image on the page, fill the entire page with the image, and so on.
It had so many places where it needed to grow. Since I had worked for Adobe and had worked on Acrobat, I was selected to implement these features. Fine. Time to dust off a copy of the PDF spec and get the knowledge out of my long-term swap space. I looked at the code that implemented the PdfEncoder and I was impralled. This is a combination of impressed and appalled. Glenn had done a fine job matching the spec, but the code essentially just a set of print calls to the output stream with some code to grab the current output stream position to later build the cross reference table. It did what it needed to do in a terse, prosaic way, but it was fragile as hell. PDF is a sort of object-oriented file format that depends very heavily on the absolute location of objects in the file. The definition of the page is an object. The contents of the page is an object. The tree that refers to all the pages is an object. Each time Glenn started to write an object, he would store the object number in a private member variable and then he could point to it when he wrote the cross reference table. Every time a new feature was added that required more objects, the code just got worse and worse. I think I did one or two features this way then I asked Bill for some time to do this “right”.
I started creating a set of objects that represented PDF entities. I did this via two primary ways. The first was to create an interface, IPdfStreamable, which if an object implemented, would write the object to a stream as PDF. Then, since C# doesn’t let you apply interfaces to existing objects (see extension protocols in Swift), I wrote a general purpose Write method that took an object and a Stream. If the object was IPdfStreamable, it went to that. If it was some of the built-in types (string, double, float, int, bool, etc), it would vector to methods to stream those types. I had code to automatically handle the PDF way of referencing objects. You let the cross reference table know that an object needed to be a reference and it would automatically write it as a reference and not inline.
I made a base class to represent the PDF dictionary type and I did something super sneaky. I made it mirror C# classes. I put an attribute on every object property which signaled that this was part of the PDF specification. The attribute encapsulated information in the spec, including things like what version of PDF is it part of, whether or not it was required, what it’s default value was (if any), and if it had to be a reference to an object. Then I implemented IPdfStreamable on it and put in code to handle most of the reasonable defaults. Now I could practically type the PDF spec in as C# code and it was automatically written for me. Sweet.
After 3 weeks of this, I rewrote the existing PdfEncoder in terms of the new objects. Adding new features was way easier. Encode multiple images into a PDF? Easy. Support for more esoteric images types? Done. Full support for document metadata? Got it. ICC color profiles? Tricky, but doable. Pretty soon we had some really nice example code – for example, to take a directory full of images and turn it into a single PDF was now trivial. Or how about reading directly from a scanner to PDF? Not that much different, to tell the truth.
We had an OCR engine that we worked with that had an option for PDF output. It worked, but it looked cruddy. Part of the issue was that the PDF encoding didn’t happen until late in the process, so the image was always 1 bit per pixel no matter what it was scanned at. And they charged a lot for it.
I wrote a new module for our veneer over the OCR engine which was a new PDF output module and since we owned the image pipeline, I could make sure that the output image in the PDF matched the quality that the end user wanted. And it was all based on the same code used for the PdfEncoder.
Over years, I grew the functionality of the underlying code to include consuming PDF. My initial cut of public functionality, was an object that represented a PDF document in a very coarse sense. You got a collection of flyweight pages, document metadata and not much else. But, you could rip pages out of the document and put them into another document. You could combine multiple documents. You could reorder pages. These were all things that our customers wanted and we made them easy to use, easy to understand and performant. It was hell to get right. When you executed the “save” method, the work under the hood was akin to picking up a huge fishing net at one knot, clipping some sections, sewing in others and setting it down somewhere else perfectly folded. There were bugs. There were problems with crappy PDF files from well-intentioned people. But it worked pretty damn well. I think that release also included support for PDF/A (archival PDF) thanks to Rick Minerich.
Our customers were pleased with this. How do I know? Because they kept asking for more features. I prioritized them and added them in. Because I had meta information about the document structure from the PDF spec, I put in code that could find, classify, and automatically repair shit PDF.
All this time, I hid my library from our users. We could have made it all public which is what iText does, but when you do that you require your customers to understand the PDF spec and trust me when I say that that is always a mistake. The burden on support alone would have been egregious.
As the toolset grew, I added in annotation support and we now had hooks into our DotAnnotate product. We could read and write annotations to and from PDF files and let users view/create/edit them. Nice. I put a veneer onto that code into PDF document as well.
One thing missing from the picture was generating and editing PDF. This is not a small task because now we had to have a model of PDF available to our customers that they could work with but still not know the PDF spec. I put in cushy abstractions onto the core types and created a means of making documents from whole cloth with just about all the PDF elements available. I created a shape abstraction and a bunch of default shapes to make it easy to plop down rectangles, ellipses, Bezier curves and so on. Kevin Hulse wrote text objects onto these abstractions including a text formatting engine and code that could run text alone an arbitrary path.
To check on performance, I wrote some code that took the text from Moby Dick from Project Gutenberg and formatted it into a book complete with drop caps at the chapters and page numbers. The code was not complicated and it formatted and wrote the book in about a second, if I recall correctly.
I put in better support for annotations including custom appearances and actions. For grins, I wrote some code that created a document with a button annotation labeled “Pull My Finger” that when you clicked on the button played a fart noise. Probably the best use for PDF ever. I sent the document to Elaine and waited for the laughter across the office. I also wrote some code that took an image and created a colored rectangle annotation on the page for every pixel. It was tens of thousands of annotations. It took a few seconds to write and my code could round trip it in similar time. Acrobat took minutes to open the file.
I made it so it was possible to round trip documents created by this code, so now we had limited editing. If a page had been made by my tool, I could recreate the nice, fluffy objects for you. I started putting into place the infrastructure to make all PDF editable, and started exposing that through some internal tools that could now do text extraction. I think it took me a few weeks to do. The original Wordy algorithm by Daryoush Paknad took significantly longer to create, but he was inventing it and he had to work in C. I stood on his shoulders and did the work in C#. For grins, I built the code using a page quadtree subdivision algorithm that had automatic annealing of split words based on font position and similarity. And again, it ran very well.
One internal demo I did was to write an interactive drawing app. It gave you a view of a page and let you sketch out shapes on the page and then write them out. And I wrote it in the worst possible way: when you started to draw a shape, it would translate the mouse movements into PDF, write a PDF document, then use our PDF renderer to draw it in the window. It could do this and the UI felt perfectly fluid until you got about a dozen objects on the page, then it started to feel mushy. Again, I was impralled.
The last bit of technology that went into the code was digital signatures. I had put off that bit of code for a very long time because I was trying to take my own advice to never write time management code, never get in a land war in Asia, and never write digital signature code.
Honestly, it was a hard decision to walk away from this product. It was reasonably easy to use, and at the high level it was nearly impossible to write a bad PDF (and there was some preflight code that let you catch errors), it wasn’t particularly huge, and it ran well even though it was all in C#.
In many ways, it felt like the Master Control Program from Tron, but you know, without all the evil overlord stuff. DotPdf started out as a small tool to make my life easier, but grew and invaded the rest of our product line. At the end of my tenure, there were 4 or 5 products that all had significant features rooted in DotPdf or its underlying library. I will always be proud of that code.