Tools for Publishing Source Code via OCR
example | ||
ps | ||
tools | ||
.gitignore | ||
MANIFEST | ||
README |
PREFACE ------- This book grew out of a project to publish source code for cryptographic software, namely PGP (Pretty Good Privacy), a software package for the encryption of electronic mail and computer files. PGP is the most widely used software in the world for email encryption. Pretty Good Privacy, Inc (or "PGP") has published the source code of PGP for peer review, a long- standing tradition in the history of PGP. The first time a fully implemented cryptographic software package was published in its entirety in book form was "PGP Source Code and Internals," by Philip Zimmermann, published by The MIT Press, 1995, ISBN 0-262-24039-4. Peer review of the source code is important to get users to trust the software, since any weaknesses can be detected by knowledgeable experts who make the effort to review the code. But peer review cannot be completely effective unless the experts conducting the review can compile and test the software, and verify that it is the same as the software products that are published electronically. To facilitate that, PGP publishes its source code in printed form that can be scanned into a computer via OCR (optical character recognition) technology. Why not publish the source code in electronic form? As you may know, cryptographic software is subject to U.S. export control laws and regulations. The new 1997 Commerce Department Export Administration Regulations (EAR) explicitly provide that "A printed book or other printed material setting forth encryption source code is not itself subject to the EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution, has only made available its source code in a form that is not subject to those regulations. So, books containing cryptographic source code may be published, and after they are published they may be exported, but only while they are still in printed form. Electronic commerce on the Internet cannot fully be successful without strong cryptography. Cryptography is important for protecting our privacy, civil liberties, and the security of our personal and business transactions in the information age. The widespread deployment of strong cryptography can help us regain some of the privacy and security that we have lost due to information technology. Further, strong cryptography (in the form of PGP) has already proven itself to be a valuable tool for the protection of human rights in oppressive countries around the world, by keeping those governments from reading the communications of human rights workers. This book of tools contains no cryptographic software of any kind, nor does it call, connect, nor integrate in any way with cryptographic software. But it does contain tools that make it easy to publish source code in book form. And it makes it easy to scan such source code in with OCR software rapidly and accurately. Philip Zimmermann prz@acm.org November 1997 INTRODUCTION ------------ This book contains tools for printing computer source code on paper in human-readable form and reconstructing it exactly using automated tools. While standard OCR software can recover most of the graphic characters, non-printing characters like tabs, spaces, newlines and form feeds cause problems. In fact, these tools can print any ASCII text file; it's just that the attention these tools pay to spacing is particularly valuable for computer source code. The two-dimensional indentation structure of source code is very important to its comprehensibility. In some cases, distinctions between non-printing characters are critical: the standard make utility will not accept spaces where it expects to see a tab character. Producing a byte-for-byte identical copy of the original is also valuable for authentication, as you can verify a checksum. There are five problems we have addressed: 1. Getting good OCR accuracy. 2. Preserving whitespace. 3. Preserving lines longer than can be printed on the page. 4. Dealing with data that isn't human-readable. 5. Detecting and correcting any residual errors. The first problem is partly addressed by using a font designed for OCR purposes, OCR-B. OCR-A is a very ugly font that contains only the digits 0 through 9 and a few special punctuation symbols. OCR-B is a very readable monospaced font that contains a full ASCII set, and has been popular as a font on line printers for years because it distinguishes ambiguous characters and is clear even if fuzzy or distorted. The most unusual thing about the OCR-B font is the way that it prints a lower-case letter 1, with a small hook on the bottom, something like an upper-case L. This is to distinguish it from the numeral 1. We also made some modifications to the font, to print the numeral 0 with a slash, and to print the vertical bar in a broken form. Both of these are such common variants that they should not present any intelligibility barrier. Finally, we print the underscore character in a distinct manner that is hopefully not visually distracting, but is clearly distinguishable from the minus sign even in the absence of a baseline reference. The most significant part of getting good OCR accuracy is, however, using the OCR tools well. We've done a lot of testing and experimentation and present here a lot of information on what works and what doesn't. To preserve whitespace, we added some special symbols to display spaces, tabs, and form feeds. A space is printed as a small triangular dot character, while a hollow rightward-pointing triangle (followed by blank spaces to the right tab stop) signifies a tab. A form feed is printed as a yen symbol, and the printed line is broken after the form feed. Making the dot triangular instead of square helps distinguish it from a period. To reduce the clutter on the page and make the text more readable, the space character is only printed as a small dot if it follows a blank on the page (a tab or another space), or comes immediately before the end of the line. Thus, the reader (human or software) must be able to distinguish one space from no spaces, but can find multiple spaces by counting the dots (and adding one). The format is designed so that 80 characters, plus checksums, can be printed on one line of an 8.5x11" (or A4) page, the still-common punched card line length. Longer lines are managed with the simple technique of appending a big ugly black blob to the first part of the line indicating that the next printed line should be concatenated with the current one with no intervening newline. Hopefully, its use is infrequent. While ASCII text is by far the most popular form, some source code is not readable in the usual way. It may be an audio clip, a graphic image bitmap, or something else that is manipulated with a specialized editing tool. For printing purposes, these tools just print any such files as a long string of gibberish in a 64-character set designed to be easy to OCR unambiguously. Although the tools recognize such binary data and apply extra consistency checks, that can be considered a separate step. Finally, the problem of residual errors arises. OCR software is not perfect, and uses a variety of heuristics and spelling-check dictionaries to clean up any residual errors in human-language text. This isn't reliable enough for source code, so we have added per-page and per-line checksums to the printed material, and a series of tools to use those checksums to correct any remaining errors and convert the scanned text into a series of files again. This "munged" form is what you see in most of the body of this book. We think it does a good job of presenting source code in a way that can be read easily by both humans and computers. The tools are command-line oriented and a bit clunky. This has a purpose beyond laziness on the authors' parts: it keeps them small. Keeping them small makes the "bootstrapping" part of scanning this book easier, since you don't have the tools to help you with that. SCANNING -------- Our tests were done with OmniPage 7.0 on a Power Macintosh 8500/120 and an HP ScanJet 4c scanner with an automatic document feeder. The first part of this is heavily OmniPage-specific, as that appears to be the most widely available OCR software. The tools here were developed under Linux, and should be generally portable to any Unix platform. Since this book is about printing and scanning source code, we assume the readers have enough programming background to know how to build a program from a Makefile, understand the hazards of CR, LF or CRLF line endings, and such minor details without explicit mention. The first step to getting OrnniPage 7 to work well is to set it up with options to disable all of its more advanced features for preserving font changes and formatting. Look in the Seffings menu. · Create a Zone Contents File with all of ASCII in it, plus the extra bullet, currency, yen and pilcrow symbols. Name it "Source Code". · Create a Source Code style set. Within it, create a Source Code zone style and make it the default. · Set the font to something fixed-width, like Courier. · Set a fixed font size (10 point) and plain text, left-aligned. · Set the tab character to a space. · Set the text flow to hard line returns. · Set the margins to their widest. · The font mapping options are irrelevant. Go to the settings panel and: · Under Scanner, set the brightness to manual. With careful setting of the threshold, this generates much better results than either the automatic threshold or the 3D OCR. Around 144 has been a good setting for us; you may want to start there. · Under OCR, you'll build a training file to use later, but turn off automatic page orientation and select your Source Code style set in the Output Options. Also set a reasonable reject character. (For test, we used the pi symbol, which came across from the Macintosh as a weird sequence, but you can use anything as long as you make the appropriate definition in subst.c.) Do an initial scan of a few pages and create a manual zone encompassing all of the text. Leave some margin for page misalignment, and leave space on the sides for the left-right shift caused by the book binding being in different places on odd and even pages. Set the Zone Contents and the Style set to the Source Code settings. After setting the Style Set, the Zone Style should be automatically set correctly (since you set Source Code as the default). Then save the Zone Template, and in the pop-up menu under the Zone step on the main toolbar you can now select it. Now we're ready to get characters recognized. The first results will be terrible, with lots of red (unrecognizable) and green (suspicious) text in the recognized window. Some tweaking will improve this enormously. The first step is setting a good black threshold. Auto brightness sets the threshold too low, making the character outlines bleed and picking up a lot of glitches on mostly-blank pages. Try training OCR on the few pages you've scanned and look at the representative characters. Adjust the threshold so the strokes are clear and distinct, neither so thin they are broken nor so think they smear into each other. The character that bleeds worst is lowercase w, while the underscore and tab symbols have the thinnest lines that need worry. You'll have to re-scan (you can just click the AUTO button) until you get satisfactory results. The next step is training. You should scan a significant number of pages and teach OmniPage about any characters it has difficulty with. There are several characters which have been printed in unusual ways which you must teach OmniPage about before it can recognize them reliably. We also have some characters that are unique, which the tools expect to be mapped to specific Latin-1 characters to be processed. They characters most in need of training are as follows: · Zero is printed 'slashed.' · Lowercase L has a curled tail to distinguish it clearly from other vertical characters like 1 and I. · The or-bar or pipe symbol '|' is printed "broken" with a gap in the middle to distinguish it similarly. · The underscore character has little "serifs" on the end to distinguish it from a minus sign. We also raised it a just a tad higher than the normal underscore character, which was too low in the character cell to be reliably seen by OmniPage. · Tabs are printed as a hollow right-pointing triangle, followed by blanks to the correct alignment position. If not trained enough, OmniPage guesses this is a capital D. You should train OmniPage to recognize this symbol as a currency symbol (Latin-1 244). · Any spaces in the original that follow a space, or a blank on the printed page, are printed as a tiny black triangle. You should train OmniPage to recognize this as a center dot or bullet (Latin-1 267). We didn't use a standard center dot because OmniPage confused it with a period. · Any form feeds in the original are printed as a yen currency symbol (Latin-1 245). · Lines over 80 columns long are broken after 79 columns by appending a big ugly black block. You should train OmniPage to recognize this as a pilcrow (paragraph symbol, Latin-1 266). We did this because after deciding something black and visible was suitable, we found out the font we used doesn't have a pilcrow in it. The zero and the tab character, because of their frequency, deserve special attention. In addition, look for any unrecognized characters (in red) and retrain those pages. If you get an unrecognized character, that character needs training, but Caere says that "good examples" are best to train on, so if the training doesn't recognize a slightly fuzzy K, and there's a nice crisp K available to train on, use that. Other things that need training: · ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped frequently unless you train them. · i, j and; (semicolon). These get mixed up. · 3 and S. These also get mixed up. · Q can fail to be recognized. · C and [ can be confused. · c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This can be helped by some training. · r gets confused with c and n. I don't understand c, but it happens. · f gets confused with i. The OCR training pages have lots of useful examples of troublesome characters. Scan a few pages of material, training each page, then scan a few dozen pages and look for recognition problems. Look for what OmniPage reports as troublesome, and when you have the repair program working, use it to find and report further errors. Train a few pages particularly dense in problems and append the troublesome characters to the training file, the re-recognize the lot. Double-check your training file for case errors. It's easy to miss the shift key in the middle of a lot of training and will result in terrible results even though OmniPage won't report anything amiss. We have spent a while wondering why OmniPage wasn't recognizing capital S or capital W, only to find that OmniPage was just doing what it was trained to do. We have heard some reports that OmniPage has problems with large training files. We have observed OmniPage suffering repeatable internal errors sometimes after massive training additions, but they were cured by deleting a few training images. Appending more training images to the training file did not cause the problem to re-appear. Repairing the OCR results If the only copy of the tools you have is printed in this book, see the next chapter on bootstrapping at this point. Here, we assume that you have the tools and they work. When you have some reasonable OCR results, delete any directory pages. With no checksum information, they just confuse the postprocessing tools. (The tools will just stop with an error when they get to the "uncorrectable" directory name and you'll have to delete it then, so it's not fatal if you forget.) Copy the data to a machine that you have the repair and unmunge utilities on. The repair utility attempts automatic table-driven correction of common scanning errors. You have to recompile it to change the tables, but are encouraged to if you find a common problem that it does not correct reliably. If it gets stuck, it will deposit you into your favorite editor on or slightly after the offending line. (The file you will be editing is the unprocessed portion of the input.) After you correct the problem and quit the editor, repair will resume. "Your favorite editor" is taken from the $VISUAL and $EDITOR environment variables, or the -e option to repair. The repair utility never alters the original input file. It will produce corrected output for file in file.out, and when it has to stop, it writes any remaining uncorrected input back out to file.in (via a temporary file.dump) and lets you edit this file. If you re-run repair on file and file.in exists, repair will restart from there, so you may safely quit and re-run repair as often as you like. (But if you change the input file, you need to delete the .in file for repair to notice the change.) Statistics on repair's work are printed to file.log. This is an excellent place to look to see if any characters require more training. As it works, repair prints the line it is working on. If you see it make a mistake or get stuck, you can interrupt it (control-C or whatever is appropriate), and it will immediately drop into the editor. If you interrupt it a second time, it will exit rather than invoking the editor. If the editor returns a non-zero result code (fails), repair will also stop. (E.g. :cq in vim.) One thing that repair fixes without the least trouble is the number of spaces expected after a printing tab character. It's such an omnipresent OCR software error that repair doesn't even log it as a correction. In some cases, repair can miscorrect a line and go on to the next line, possibly even more than once, finally giving up a few lines below the actual error. If you are having trouble spotting the error, one helpful trick is to exit the editor and let repair try to fix the page again, but interrupt it while it is still working on the first line, before it has found the miscorrection. The Nasty Lines Some lines of code, particularly those containing long runs of underscore or minus characters, are particularly difficult to scan reliably. The repair program has a special "nasty lines" feature to deal with this. If a file named "nastylines" (or as specified by the -l option) exists, they are checksummed and are considered as total replacements for any input line with the same checksum. So, for example, if you place a blank line in the nastylines file, any scanner noise on blank lines will be ignored. The "nastylines" file is re-read every time repair restarts after an edit, so you can add more lines as the program runs. (The error-correction patterns should be done this way, too, but that'll have to wait for the next release.) Sortpages If, in the course of scanning, the pages have been split up or have gotten out of order, a perl script called sortpages can restore them to the proper order. It can merge multiple input files, discard duplicates, and warns about any missing pages it encounters. This script requires that the pages have been repaired, so that the page headers can be read reliably. The repair program does not care about the order it works on pages in; it examines each page independently. Unmunge, however, does need the pages in order. Unmunging After repair has finished its work, the unmunge program strips out the checksums and, based on the page headers, divides the data up among various files. Its first argument is the file to unpack. The optional second argument is a manifest file that lists all of the files and the directories they go in. Supplying this (an excellent idea) lets unmunge recreate a directory hierarchy and warn about missing files. When you have unmunged everything and reconstructed the original source code, you are done. Unmunge verifies all of the checksums independently of repair, as a sanity check, and you can have high confidence that the files are exactly the same as the originals that were printed. BOOTSTRAPPING ------------- There's a problem using the postprocessing tools to correct OCR errors, when the code being OCRed is the tools themselves. We've tried to provide a reasonably easy way to get the system up and running starting from nothing but a copy of OmniPage. You could just scan all of the tools in, correct any errors by hand, delete the error-checking information in a text editor, and compile them. But finding all the errors by hand is painful in a body of code that large. With the aid of perl (version 5), which provides a lot of power in very little code, we have provided some utilities to make this process easier. The first-stage bootstrap is a one-page perl script designed to be as small and simple as possible, because you'll have to hand-correct it. It can verify the checksums on each line, and drop you into the editor on any lines where an error has occurred. It also knows how to strip out the visible spaces and tabs, how to correct spacing errors after visible tab characters, and how to invoke an editor on the erroneous line. Scan in the first-stage bootstrap as carefully as possible, using OmniPage's warnings to guide you to any errors, and either use a text editor or the one-line perl command at the top of the file to remove the checksums and convert any funny printed characters to whitespace form. The first thing to do is try running it on itself, and correct any errors you find this way. Note that the script writes its output to the file named in the page header, so you should name your hand-corrected version differently (or put it in a different directory) to avoid having it overwritten. The second-stage bootstrap is a much denser one-pager, with better error detection; it can detect missing lines and missing pages, and takes an optional second argument of a manifest file which it can use to put files in their proper directories. It's not strictly necessary, but it's only one more (dense) page and you can check it against itself and the original bootstrap. Both of the botstrap utilities can correct tab spacing errors in the OCR output. Although this doesn't matter in most source code, it is included in the checksums. Once you have reached this point, you can scan in the C code for repair and unmunge. The C unmunge is actually less friendly than the bootstrap utilities, because it is only intended to work with the output of repair. It is, however, much faster, since computing CRCs a bit at a time in an interpreted language is painfully slow for large amounts of data. It can also deal with binary files printed in radix-64. PRINTING -------- Despite the title of this book, this process of producing a book is not well documented, since it's been evolving up to the moment of publication. There, is, however, a very useful working example of how to produce a book (strikingly similar to this book) in the example directory, all controlled by a Makefile. Briefly, a master perl script called psgen takes three parameters: a file list, a page numbers file to write to, and a volume number (which should always be 1 for a one-volume book). It runs the listed files through the munge utility, wraps them in some simple PostScript, and prepends a prolog that defines the special characters and PostScript functions needed by the text. The file list also includes per-file flags. The most important is the text/binary marker. Text files can also have a tab width specified, although munge knows how to read Emacs-style tab width settings from the end of a source file. The prolog is assembled from various other files and defines by psgen using a simple preprocessor called yapp (Yet Another Preprocessor). This process includes some book-specific information like the page footer. Producing the final PostScript requires the necessary non-standard fonts (Futura for the footers and OCRB for the code) and the psutils package, which provides the includeres utility used to embed the fonts in the PostScript file. The fonts should go in the books/ps directory, as "Futura.pfa" and the like. The pagenums file can be used to produce a table of contents. For this book, we generated the front matter (such as this chapter) separately, told psgen to start on the next page after this, and concatenated the resultant PostScript files for printing. The only trick was making the page footers look identical.