ocr-tools/README
2019-05-15 18:32:48 +02:00

478 lines
24 KiB
Plaintext

PREFACE
-------
This book grew out of a project to publish source code for cryptographic
software, namely PGP (Pretty Good Privacy), a software package for the
encryption of electronic mail and computer files. PGP is the most widely
used software in the world for email encryption. Pretty Good Privacy, Inc
(or "PGP") has published the source code of PGP for peer review, a long-
standing tradition in the history of PGP. The first time a fully implemented
cryptographic software package was published in its entirety in book form
was "PGP Source Code and Internals," by Philip Zimmermann, published by The
MIT Press, 1995, ISBN 0-262-24039-4.
Peer review of the source code is important to get users to trust the
software, since any weaknesses can be detected by knowledgeable experts who
make the effort to review the code. But peer review cannot be completely
effective unless the experts conducting the review can compile and test the
software, and verify that it is the same as the software products that are
published electronically. To facilitate that, PGP publishes its source code
in printed form that can be scanned into a computer via OCR (optical
character recognition) technology.
Why not publish the source code in electronic form? As you may know,
cryptographic software is subject to U.S. export control laws and
regulations. The new 1997 Commerce Department Export Administration
Regulations (EAR) explicitly provide that "A printed book or other printed
material setting forth encryption source code is not itself subject to the
EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution,
has only made available its source code in a form that is not subject to
those regulations. So, books containing cryptographic source code may be
published, and after they are published they may be exported, but only
while they are still in printed form.
Electronic commerce on the Internet cannot fully be successful without
strong cryptography. Cryptography is important for protecting our privacy,
civil liberties, and the security of our personal and business transactions
in the information age. The widespread deployment of strong cryptography
can help us regain some of the privacy and security that we have lost due
to information technology. Further, strong cryptography (in the form of
PGP) has already proven itself to be a valuable tool for the protection of
human rights in oppressive countries around the world, by keeping those
governments from reading the communications of human rights workers.
This book of tools contains no cryptographic software of any kind, nor does
it call, connect, nor integrate in any way with cryptographic software. But
it does contain tools that make it easy to publish source code in book form.
And it makes it easy to scan such source code in with OCR software rapidly
and accurately.
Philip Zimmermann
prz@acm.org
November 1997
INTRODUCTION
------------
This book contains tools for printing computer source code on paper in
human-readable form and reconstructing it exactly using automated tools.
While standard OCR software can recover most of the graphic characters,
non-printing characters like tabs, spaces, newlines and form feeds cause
problems.
In fact, these tools can print any ASCII text file; it's just that the
attention these tools pay to spacing is particularly valuable for computer
source code. The two-dimensional indentation structure of source code is
very important to its comprehensibility. In some cases, distinctions
between non-printing characters are critical: the standard make utility
will not accept spaces where it expects to see a tab character.
Producing a byte-for-byte identical copy of the original is also valuable
for authentication, as you can verify a checksum.
There are five problems we have addressed:
1. Getting good OCR accuracy.
2. Preserving whitespace.
3. Preserving lines longer than can be printed on the page.
4. Dealing with data that isn't human-readable.
5. Detecting and correcting any residual errors.
The first problem is partly addressed by using a font designed for OCR
purposes, OCR-B. OCR-A is a very ugly font that contains only the digits 0
through 9 and a few special punctuation symbols. OCR-B is a very readable
monospaced font that contains a full ASCII set, and has been popular as a
font on line printers for years because it distinguishes ambiguous
characters and is clear even if fuzzy or distorted.
The most unusual thing about the OCR-B font is the way that it prints a
lower-case letter 1, with a small hook on the bottom, something like an
upper-case L. This is to distinguish it from the numeral 1. We also made
some modifications to the font, to print the numeral 0 with a slash, and
to print the vertical bar in a broken form. Both of these are such common
variants that they should not present any intelligibility barrier. Finally,
we print the underscore character in a distinct manner that is hopefully
not visually distracting, but is clearly distinguishable from the minus
sign even in the absence of a baseline reference.
The most significant part of getting good OCR accuracy is, however, using
the OCR tools well. We've done a lot of testing and experimentation and
present here a lot of information on what works and what doesn't.
To preserve whitespace, we added some special symbols to display spaces,
tabs, and form feeds. A space is printed as a small triangular dot
character, while a hollow rightward-pointing triangle (followed by blank
spaces to the right tab stop) signifies a tab. A form feed is printed as
a yen symbol, and the printed line is broken after the form feed.
Making the dot triangular instead of square helps distinguish it from a
period. To reduce the clutter on the page and make the text more readable,
the space character is only printed as a small dot if it follows a blank
on the page (a tab or another space), or comes immediately before the end
of the line. Thus, the reader (human or software) must be able to
distinguish one space from no spaces, but can find multiple spaces by
counting the dots (and adding one).
The format is designed so that 80 characters, plus checksums, can be
printed on one line of an 8.5x11" (or A4) page, the still-common punched
card line length. Longer lines are managed with the simple technique of
appending a big ugly black blob to the first part of the line indicating
that the next printed line should be concatenated with the current one
with no intervening newline. Hopefully, its use is infrequent.
While ASCII text is by far the most popular form, some source code is not
readable in the usual way. It may be an audio clip, a graphic image bitmap,
or something else that is manipulated with a specialized editing tool. For
printing purposes, these tools just print any such files as a long string
of gibberish in a 64-character set designed to be easy to OCR unambiguously.
Although the tools recognize such binary data and apply extra consistency
checks, that can be considered a separate step.
Finally, the problem of residual errors arises. OCR software is not perfect,
and uses a variety of heuristics and spelling-check dictionaries to clean up
any residual errors in human-language text. This isn't reliable enough for
source code, so we have added per-page and per-line checksums to the printed
material, and a series of tools to use those checksums to correct any
remaining errors and convert the scanned text into a series of files again.
This "munged" form is what you see in most of the body of this book. We
think it does a good job of presenting source code in a way that can be read
easily by both humans and computers.
The tools are command-line oriented and a bit clunky. This has a purpose
beyond laziness on the authors' parts: it keeps them small. Keeping them
small makes the "bootstrapping" part of scanning this book easier, since you
don't have the tools to help you with that.
SCANNING
--------
Our tests were done with OmniPage 7.0 on a Power Macintosh 8500/120 and an
HP ScanJet 4c scanner with an automatic document feeder. The first part of
this is heavily OmniPage-specific, as that appears to be the most widely
available OCR software.
The tools here were developed under Linux, and should be generally portable
to any Unix platform. Since this book is about printing and scanning source
code, we assume the readers have enough programming background to know how
to build a program from a Makefile, understand the hazards of CR, LF or CRLF
line endings, and such minor details without explicit mention.
The first step to getting OrnniPage 7 to work well is to set it up with
options to disable all of its more advanced features for preserving font
changes and formatting. Look in the Seffings menu.
· Create a Zone Contents File with all of ASCII in it, plus the extra
bullet, currency, yen and pilcrow symbols. Name it "Source Code".
· Create a Source Code style set. Within it, create a Source Code zone style
and make it the default.
· Set the font to something fixed-width, like Courier.
· Set a fixed font size (10 point) and plain text, left-aligned.
· Set the tab character to a space.
· Set the text flow to hard line returns.
· Set the margins to their widest.
· The font mapping options are irrelevant.
Go to the settings panel and:
· Under Scanner, set the brightness to manual. With careful setting of the
threshold, this generates much better results than either the automatic
threshold or the 3D OCR. Around 144 has been a good setting for us; you
may want to start there.
· Under OCR, you'll build a training file to use later, but turn off
automatic page orientation and select your Source Code style set in the
Output Options. Also set a reasonable reject character. (For test, we
used the pi symbol, which came across from the Macintosh as a weird
sequence, but you can use anything as long as you make the appropriate
definition in subst.c.)
Do an initial scan of a few pages and create a manual zone encompassing
all of the text. Leave some margin for page misalignment, and leave space
on the sides for the left-right shift caused by the book binding being in
different places on odd and even pages.
Set the Zone Contents and the Style set to the Source Code settings. After
setting the Style Set, the Zone Style should be automatically set correctly
(since you set Source Code as the default).
Then save the Zone Template, and in the pop-up menu under the Zone step on
the main toolbar you can now select it.
Now we're ready to get characters recognized. The first results will be
terrible, with lots of red (unrecognizable) and green (suspicious) text in
the recognized window. Some tweaking will improve this enormously.
The first step is setting a good black threshold. Auto brightness sets the
threshold too low, making the character outlines bleed and picking up a lot
of glitches on mostly-blank pages. Try training OCR on the few pages you've
scanned and look at the representative characters. Adjust the threshold so
the strokes are clear and distinct, neither so thin they are broken nor so
think they smear into each other. The character that bleeds worst is
lowercase w, while the underscore and tab symbols have the thinnest lines
that need worry.
You'll have to re-scan (you can just click the AUTO button) until you get
satisfactory results.
The next step is training. You should scan a significant number of pages
and teach OmniPage about any characters it has difficulty with. There are
several characters which have been printed in unusual ways which you must
teach OmniPage about before it can recognize them reliably. We also have
some characters that are unique, which the tools expect to be mapped to
specific Latin-1 characters to be processed.
They characters most in need of training are as follows:
· Zero is printed 'slashed.'
· Lowercase L has a curled tail to distinguish it clearly from other
vertical characters like 1 and I.
· The or-bar or pipe symbol '|' is printed "broken" with a gap in the
middle to distinguish it similarly.
· The underscore character has little "serifs" on the end to distinguish
it from a minus sign. We also raised it a just a tad higher than the
normal underscore character, which was too low in the character cell to
be reliably seen by OmniPage.
· Tabs are printed as a hollow right-pointing triangle, followed by blanks
to the correct alignment position. If not trained enough, OmniPage
guesses this is a capital D. You should train OmniPage to recognize this
symbol as a currency symbol (Latin-1 244).
· Any spaces in the original that follow a space, or a blank on the printed
page, are printed as a tiny black triangle. You should train OmniPage to
recognize this as a center dot or bullet (Latin-1 267). We didn't use a
standard center dot because OmniPage confused it with a period.
· Any form feeds in the original are printed as a yen currency symbol
(Latin-1 245).
· Lines over 80 columns long are broken after 79 columns by appending a big
ugly black block. You should train OmniPage to recognize this as a
pilcrow (paragraph symbol, Latin-1 266). We did this because after
deciding something black and visible was suitable, we found out the font
we used doesn't have a pilcrow in it.
The zero and the tab character, because of their frequency, deserve special
attention.
In addition, look for any unrecognized characters (in red) and retrain those
pages. If you get an unrecognized character, that character needs training,
but Caere says that "good examples" are best to train on, so if the training
doesn't recognize a slightly fuzzy K, and there's a nice crisp K available
to train on, use that.
Other things that need training:
· ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped
frequently unless you train them.
· i, j and; (semicolon). These get mixed up.
· 3 and S. These also get mixed up.
· Q can fail to be recognized.
· C and [ can be confused.
· c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This
can be helped by some training.
· r gets confused with c and n. I don't understand c, but it happens.
· f gets confused with i.
The OCR training pages have lots of useful examples of troublesome
characters. Scan a few pages of material, training each page, then scan a
few dozen pages and look for recognition problems. Look for what OmniPage
reports as troublesome, and when you have the repair program working, use
it to find and report further errors. Train a few pages particularly dense
in problems and append the troublesome characters to the training file, the
re-recognize the lot.
Double-check your training file for case errors. It's easy to miss the shift
key in the middle of a lot of training and will result in terrible results
even though OmniPage won't report anything amiss. We have spent a while
wondering why OmniPage wasn't recognizing capital S or capital W, only to
find that OmniPage was just doing what it was trained to do.
We have heard some reports that OmniPage has problems with large training
files. We have observed OmniPage suffering repeatable internal errors
sometimes after massive training additions, but they were cured by deleting
a few training images. Appending more training images to the training file
did not cause the problem to re-appear.
Repairing the OCR results
If the only copy of the tools you have is printed in this book, see the next
chapter on bootstrapping at this point. Here, we assume that you have the
tools and they work.
When you have some reasonable OCR results, delete any directory pages. With
no checksum information, they just confuse the postprocessing tools. (The
tools will just stop with an error when they get to the "uncorrectable"
directory name and you'll have to delete it then, so it's not fatal if you
forget.) Copy the data to a machine that you have the repair and unmunge
utilities on.
The repair utility attempts automatic table-driven correction of common
scanning errors. You have to recompile it to change the tables, but are
encouraged to if you find a common problem that it does not correct reliably.
If it gets stuck, it will deposit you into your favorite editor on or
slightly after the offending line. (The file you will be editing is the
unprocessed portion of the input.) After you correct the problem and quit
the editor, repair will resume.
"Your favorite editor" is taken from the $VISUAL and $EDITOR environment
variables, or the -e option to repair.
The repair utility never alters the original input file. It will produce
corrected output for file in file.out, and when it has to stop, it writes
any remaining uncorrected input back out to file.in (via a temporary
file.dump) and lets you edit this file. If you re-run repair on file and
file.in exists, repair will restart from there, so you may safely quit and
re-run repair as often as you like. (But if you change the input file, you
need to delete the .in file for repair to notice the change.)
Statistics on repair's work are printed to file.log. This is an excellent
place to look to see if any characters require more training.
As it works, repair prints the line it is working on. If you see it make a
mistake or get stuck, you can interrupt it (control-C or whatever is
appropriate), and it will immediately drop into the editor. If you interrupt
it a second time, it will exit rather than invoking the editor. If the
editor returns a non-zero result code (fails), repair will also stop. (E.g.
:cq in vim.)
One thing that repair fixes without the least trouble is the number of
spaces expected after a printing tab character. It's such an omnipresent OCR
software error that repair doesn't even log it as a correction.
In some cases, repair can miscorrect a line and go on to the next line,
possibly even more than once, finally giving up a few lines below the actual
error. If you are having trouble spotting the error, one helpful trick is to
exit the editor and let repair try to fix the page again, but interrupt it
while it is still working on the first line, before it has found the
miscorrection.
The Nasty Lines
Some lines of code, particularly those containing long runs of underscore or
minus characters, are particularly difficult to scan reliably. The repair
program has a special "nasty lines" feature to deal with this. If a file
named "nastylines" (or as specified by the -l option) exists, they are
checksummed and are considered as total replacements for any input line with
the same checksum. So, for example, if you place a blank line in the
nastylines file, any scanner noise on blank lines will be ignored.
The "nastylines" file is re-read every time repair restarts after an edit,
so you can add more lines as the program runs. (The error-correction patterns
should be done this way, too, but that'll have to wait for the next release.)
Sortpages
If, in the course of scanning, the pages have been split up or have gotten
out of order, a perl script called sortpages can restore them to the proper
order. It can merge multiple input files, discard duplicates, and warns about
any missing pages it encounters. This script requires that the pages have
been repaired, so that the page headers can be read reliably. The repair
program does not care about the order it works on pages in; it examines each
page independently. Unmunge, however, does need the pages in order.
Unmunging
After repair has finished its work, the unmunge program strips out the
checksums and, based on the page headers, divides the data up among various
files. Its first argument is the file to unpack. The optional second argument
is a manifest file that lists all of the files and the directories they go
in. Supplying this (an excellent idea) lets unmunge recreate a directory
hierarchy and warn about missing files.
When you have unmunged everything and reconstructed the original source code,
you are done. Unmunge verifies all of the checksums independently of repair,
as a sanity check, and you can have high confidence that the files are
exactly the same as the originals that were printed.
BOOTSTRAPPING
-------------
There's a problem using the postprocessing tools to correct OCR errors, when
the code being OCRed is the tools themselves. We've tried to provide a
reasonably easy way to get the system up and running starting from nothing
but a copy of OmniPage.
You could just scan all of the tools in, correct any errors by hand, delete
the error-checking information in a text editor, and compile them. But
finding all the errors by hand is painful in a body of code that large.
With the aid of perl (version 5), which provides a lot of power in very
little code, we have provided some utilities to make this process easier.
The first-stage bootstrap is a one-page perl script designed to be as small
and simple as possible, because you'll have to hand-correct it. It can verify
the checksums on each line, and drop you into the editor on any lines where
an error has occurred. It also knows how to strip out the visible spaces and
tabs, how to correct spacing errors after visible tab characters, and how to
invoke an editor on the erroneous line.
Scan in the first-stage bootstrap as carefully as possible, using OmniPage's
warnings to guide you to any errors, and either use a text editor or the
one-line perl command at the top of the file to remove the checksums and
convert any funny printed characters to whitespace form.
The first thing to do is try running it on itself, and correct any errors you
find this way. Note that the script writes its output to the file named in
the page header, so you should name your hand-corrected version differently
(or put it in a different directory) to avoid having it overwritten.
The second-stage bootstrap is a much denser one-pager, with better error
detection; it can detect missing lines and missing pages, and takes an
optional second argument of a manifest file which it can use to put files
in their proper directories. It's not strictly necessary, but it's only one
more (dense) page and you can check it against itself and the original
bootstrap.
Both of the botstrap utilities can correct tab spacing errors in the OCR
output. Although this doesn't matter in most source code, it is included
in the checksums.
Once you have reached this point, you can scan in the C code for repair and
unmunge. The C unmunge is actually less friendly than the bootstrap
utilities, because it is only intended to work with the output of repair.
It is, however, much faster, since computing CRCs a bit at a time in an
interpreted language is painfully slow for large amounts of data. It can
also deal with binary files printed in radix-64.
PRINTING
--------
Despite the title of this book, this process of producing a book is not well
documented, since it's been evolving up to the moment of publication. There,
is, however, a very useful working example of how to produce a book
(strikingly similar to this book) in the example directory, all controlled
by a Makefile.
Briefly, a master perl script called psgen takes three parameters: a file
list, a page numbers file to write to, and a volume number (which should
always be 1 for a one-volume book). It runs the listed files through the
munge utility, wraps them in some simple PostScript, and prepends a prolog
that defines the special characters and PostScript functions needed by the
text.
The file list also includes per-file flags. The most important is the
text/binary marker. Text files can also have a tab width specified, although
munge knows how to read Emacs-style tab width settings from the end of a
source file.
The prolog is assembled from various other files and defines by psgen using
a simple preprocessor called yapp (Yet Another Preprocessor). This process
includes some book-specific information like the page footer.
Producing the final PostScript requires the necessary non-standard fonts
(Futura for the footers and OCRB for the code) and the psutils package,
which provides the includeres utility used to embed the fonts in the
PostScript file. The fonts should go in the books/ps directory, as
"Futura.pfa" and the like.
The pagenums file can be used to produce a table of contents. For this book,
we generated the front matter (such as this chapter) separately, told psgen
to start on the next page after this, and concatenated the resultant
PostScript files for printing. The only trick was making the page footers
look identical.