commit 60052b2f16264130264720e9cc3c576b51431e7e Author: rnhmjoj Date: Wed May 15 16:55:03 2019 +0200 initial commit diff --git a/MANIFEST b/MANIFEST new file mode 100644 index 0000000..231015b --- /dev/null +++ b/MANIFEST @@ -0,0 +1,32 @@ +1 test-file +2 MANIFEST +D books/ +D books/tools/ +3 bootstrap +4 bootstrap2 +5 sortpages +6 Makefile +7 heap.c +8 heap.h +9 mempool.c +10 mempool.h +11 util.c +12 util.h +13 repair.c +14 subst.c +15 subst.h +16 unmunge.c +17 munge.c +18 yapp.doc +19 yapp +20 psgen +21 makemanifest +D books/ps/ +22 prolog.ps +23 charmap.ps +D books/example/ +24 Makefile +25 .cvsignore +26 filelist +27 footer.ps +28 us-constitution.gz diff --git a/README b/README new file mode 100644 index 0000000..8abf4a8 --- /dev/null +++ b/README @@ -0,0 +1,477 @@ +PREFACE +------- + +This book grew out of a project to publish source code for cryptographic +software, namely PGP (Pretty Good Privacy), a software package for the +encryption of electronic mail and computer files. PGP is the most widely +used software in the world for email encryption. Pretty Good Privacy, Inc +(or "PGP") has published the source code of PGP for peer review, a long- +standing tradition in the history of PGP. The first time a fully implemented +cryptographic software package was published in its entirety in book form +was "PGP Source Code and Internals," by Philip Zimmermann, published by The +MIT Press, 1995, ISBN 0-262-24039-4. + +Peer review of the source code is important to get users to trust the +software, since any weaknesses can be detected by knowledgeable experts who +make the effort to review the code. But peer review cannot be completely +effective unless the experts conducting the review can compile and test the +software, and verify that it is the same as the software products that are +published electronically. To facilitate that, PGP publishes its source code +in printed form that can be scanned into a computer via OCR (optical +character recognition) technology. + +Why not publish the source code in electronic form? As you may know, +cryptographic software is subject to U.S. export control laws and +regulations. The new 1997 Commerce Department Export Administration +Regulations (EAR) explicitly provide that "A printed book or other printed +material setting forth encryption source code is not itself subject to the +EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution, +has only made available its source code in a form that is not subject to +those regulations. So, books containing cryptographic source code may be +published, and after they are published they may be exported, but only +while they are still in printed form. + +Electronic commerce on the Internet cannot fully be successful without +strong cryptography. Cryptography is important for protecting our privacy, +civil liberties, and the security of our personal and business transactions +in the information age. The widespread deployment of strong cryptography +can help us regain some of the privacy and security that we have lost due +to information technology. Further, strong cryptography (in the form of +PGP) has already proven itself to be a valuable tool for the protection of +human rights in oppressive countries around the world, by keeping those +governments from reading the communications of human rights workers. + +This book of tools contains no cryptographic software of any kind, nor does +it call, connect, nor integrate in any way with cryptographic software. But +it does contain tools that make it easy to publish source code in book form. +And it makes it easy to scan such source code in with OCR software rapidly +and accurately. + +Philip Zimmermann +prz@acm.org + +November 1997 + + + +INTRODUCTION +------------ + +This book contains tools for printing computer source code on paper in +human-readable form and reconstructing it exactly using automated tools. +While standard OCR software can recover most of the graphic characters, +non-printing characters like tabs, spaces, newlines and form feeds cause +problems. + +In fact, these tools can print any ASCII text file; it's just that the +attention these tools pay to spacing is particularly valuable for computer +source code. The two-dimensional indentation structure of source code is +very important to its comprehensibility. In some cases, distinctions +between non-printing characters are critical: the standard make utility +will not accept spaces where it expects to see a tab character. + +Producing a byte-for-byte identical copy of the original is also valuable +for authentication, as you can verify a checksum. + +There are five problems we have addressed: + +1. Getting good OCR accuracy. +2. Preserving whitespace. +3. Preserving lines longer than can be printed on the page. +4. Dealing with data that isn't human-readable. +5. Detecting and correcting any residual errors. + +The first problem is partly addressed by using a font designed for OCR +purposes, OCR-B. OCR-A is a very ugly font that contains only the digits 0 +through 9 and a few special punctuation symbols. OCR-B is a very readable +monospaced font that contains a full ASCII set, and has been popular as a +font on line printers for years because it distinguishes ambiguous +characters and is clear even if fuzzy or distorted. + +The most unusual thing about the OCR-B font is the way that it prints a +lower-case letter 1, with a small hook on the bottom, something like an +upper-case L. This is to distinguish it from the numeral 1. We also made +some modifications to the font, to print the numeral 0 with a slash, and +to print the vertical bar in a broken form. Both of these are such common +variants that they should not present any intelligibility barrier. Finally, +we print the underscore character in a distinct manner that is hopefully +not visually distracting, but is clearly distinguishable from the minus +sign even in the absence of a baseline reference. + +The most significant part of getting good OCR accuracy is, however, using +the OCR tools well. We've done a lot of testing and experimentation and +present here a lot of information on what works and what doesn't. + +To preserve whitespace, we added some special symbols to display spaces, +tabs, and form feeds. A space is printed as a small triangular dot +character, while a hollow rightward-pointing triangle (followed by blank +spaces to the right tab stop) signifies a tab. A form feed is printed as +a yen symbol, and the printed line is broken after the form feed. + +Making the dot triangular instead of square helps distinguish it from a +period. To reduce the clutter on the page and make the text more readable, +the space character is only printed as a small dot if it follows a blank +on the page (a tab or another space), or comes immediately before the end +of the line. Thus, the reader (human or software) must be able to +distinguish one space from no spaces, but can find multiple spaces by +counting the dots (and adding one). + +The format is designed so that 80 characters, plus checksums, can be +printed on one line of an 8.5x11" (or A4) page, the still-common punched +card line length. Longer lines are managed with the simple technique of +appending a big ugly black blob to the first part of the line indicating +that the next printed line should be concatenated with the current one +with no intervening newline. Hopefully, its use is infrequent. + +While ASCII text is by far the most popular form, some source code is not +readable in the usual way. It may be an audio clip, a graphic image bitmap, +or something else that is manipulated with a specialized editing tool. For +printing purposes, these tools just print any such files as a long string +of gibberish in a 64-character set designed to be easy to OCR unambiguously. +Although the tools recognize such binary data and apply extra consistency +checks, that can be considered a separate step. + +Finally, the problem of residual errors arises. OCR software is not perfect, +and uses a variety of heuristics and spelling-check dictionaries to clean up +any residual errors in human-language text. This isn't reliable enough for +source code, so we have added per-page and per-line checksums to the printed +material, and a series of tools to use those checksums to correct any +remaining errors and convert the scanned text into a series of files again. + +This "munged" form is what you see in most of the body of this book. We +think it does a good job of presenting source code in a way that can be read +easily by both humans and computers. + +The tools are command-line oriented and a bit clunky. This has a purpose +beyond laziness on the authors' parts: it keeps them small. Keeping them +small makes the "bootstrapping" part of scanning this book easier, since you +don't have the tools to help you with that. + + + +SCANNING +-------- + +Our tests were done with OmniPage 7.0 on a Power Macintosh 8500/120 and an +HP ScanJet 4c scanner with an automatic document feeder. The first part of +this is heavily OmniPage-specific, as that appears to be the most widely +available OCR software. + +The tools here were developed under Linux, and should be generally portable +to any Unix platform. Since this book is about printing and scanning source +code, we assume the readers have enough programming background to know how +to build a program from a Makefile, understand the hazards of CR, LF or CRLF +line endings, and such minor details without explicit mention. + +The first step to getting OrnniPage 7 to work well is to set it up with +options to disable all of its more advanced features for preserving font +changes and formatting. Look in the Seffings menu. + +· Create a Zone Contents File with all of ASCII in it, plus the extra + bullet, currency, yen and pilcrow symbols. Name it "Source Code". +· Create a Source Code style set. Within it, create a Source Code zone style + and make it the default. +· Set the font to something fixed-width, like Courier. +· Set a fixed font size (10 point) and plain text, left-aligned. +· Set the tab character to a space. +· Set the text flow to hard line returns. +· Set the margins to their widest. +· The font mapping options are irrelevant. + +Go to the settings panel and: + +· Under Scanner, set the brightness to manual. With careful setting of the + threshold, this generates much better results than either the automatic + threshold or the 3D OCR. Around 144 has been a good setting for us; you + may want to start there. +· Under OCR, you'll build a training file to use later, but turn off + automatic page orientation and select your Source Code style set in the + Output Options. Also set a reasonable reject character. (For test, we + used the pi symbol, which came across from the Macintosh as a weird + sequence, but you can use anything as long as you make the appropriate + definition in subst.c.) + +Do an initial scan of a few pages and create a manual zone encompassing +all of the text. Leave some margin for page misalignment, and leave space +on the sides for the left-right shift caused by the book binding being in +different places on odd and even pages. + +Set the Zone Contents and the Style set to the Source Code settings. After +setting the Style Set, the Zone Style should be automatically set correctly +(since you set Source Code as the default). + +Then save the Zone Template, and in the pop-up menu under the Zone step on +the main toolbar you can now select it. + +Now we're ready to get characters recognized. The first results will be +terrible, with lots of red (unrecognizable) and green (suspicious) text in +the recognized window. Some tweaking will improve this enormously. + +The first step is setting a good black threshold. Auto brightness sets the +threshold too low, making the character outlines bleed and picking up a lot +of glitches on mostly-blank pages. Try training OCR on the few pages you've +scanned and look at the representative characters. Adjust the threshold so +the strokes are clear and distinct, neither so thin they are broken nor so +think they smear into each other. The character that bleeds worst is +lowercase w, while the underscore and tab symbols have the thinnest lines +that need worry. + +You'll have to re-scan (you can just click the AUTO button) until you get +satisfactory results. + +The next step is training. You should scan a significant number of pages +and teach OmniPage about any characters it has difficulty with. There are +several characters which have been printed in unusual ways which you must +teach OmniPage about before it can recognize them reliably. We also have +some characters that are unique, which the tools expect to be mapped to +specific Latin-1 characters to be processed. + +They characters most in need of training are as follows: + +· Zero is printed 'slashed.' +· Lowercase L has a curled tail to distinguish it clearly from other + vertical characters like 1 and I. +· The or-bar or pipe symbol '|' is printed "broken" with a gap in the + middle to distinguish it similarly. +· The underscore character has little "serifs" on the end to distinguish + it from a minus sign. We also raised it a just a tad higher than the + normal underscore character, which was too low in the character cell to + be reliably seen by OmniPage. +· Tabs are printed as a hollow right-pointing triangle, followed by blanks + to the correct alignment position. If not trained enough, OmniPage + guesses this is a capital D. You should train OmniPage to recognize this + symbol as a currency symbol (Latin-1 244). +· Any spaces in the original that follow a space, or a blank on the printed + page, are printed as a tiny black triangle. You should train OmniPage to + recognize this as a center dot or bullet (Latin-1 267). We didn't use a + standard center dot because OmniPage confused it with a period. +· Any form feeds in the original are printed as a yen currency symbol + (Latin-1 245). +· Lines over 80 columns long are broken after 79 columns by appending a big + ugly black block. You should train OmniPage to recognize this as a + pilcrow (paragraph symbol, Latin-1 266). We did this because after + deciding something black and visible was suitable, we found out the font + we used doesn't have a pilcrow in it. + +The zero and the tab character, because of their frequency, deserve special +attention. + +In addition, look for any unrecognized characters (in red) and retrain those +pages. If you get an unrecognized character, that character needs training, +but Caere says that "good examples" are best to train on, so if the training +doesn't recognize a slightly fuzzy K, and there's a nice crisp K available +to train on, use that. + +Other things that need training: + +· ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped + frequently unless you train them. +· i, j and; (semicolon). These get mixed up. +· 3 and S. These also get mixed up. +· Q can fail to be recognized. +· C and [ can be confused. +· c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This + can be helped by some training. +· r gets confused with c and n. I don't understand c, but it happens. +· f gets confused with i. + +The OCR training pages have lots of useful examples of troublesome +characters. Scan a few pages of material, training each page, then scan a +few dozen pages and look for recognition problems. Look for what OmniPage +reports as troublesome, and when you have the repair program working, use +it to find and report further errors. Train a few pages particularly dense +in problems and append the troublesome characters to the training file, the +re-recognize the lot. + +Double-check your training file for case errors. It's easy to miss the shift +key in the middle of a lot of training and will result in terrible results +even though OmniPage won't report anything amiss. We have spent a while +wondering why OmniPage wasn't recognizing capital S or capital W, only to +find that OmniPage was just doing what it was trained to do. + +We have heard some reports that OmniPage has problems with large training +files. We have observed OmniPage suffering repeatable internal errors +sometimes after massive training additions, but they were cured by deleting +a few training images. Appending more training images to the training file +did not cause the problem to re-appear. + +Repairing the OCR results + +If the only copy of the tools you have is printed in this book, see the next +chapter on bootstrapping at this point. Here, we assume that you have the +tools and they work. + +When you have some reasonable OCR results, delete any directory pages. With +no checksum information, they just confuse the postprocessing tools. (The +tools will just stop with an error when they get to the "uncorrectable" +directory name and you'll have to delete it then, so it's not fatal if you +forget.) Copy the data to a machine that you have the repair and unmunge +utilities on. + +The repair utility attempts automatic table-driven correction of common +scanning errors. You have to recompile it to change the tables, but are +encouraged to if you find a common problem that it does not correct reliably. +If it gets stuck, it will deposit you into your favorite editor on or +slightly after the offending line. (The file you will be editing is the +unprocessed portion of the input.) After you correct the problem and quit +the editor, repair will resume. + +"Your favorite editor" is taken from the $VISUAL and $EDITOR environment +variables, or the -e option to repair. + +The repair utility never alters the original input file. It will produce +corrected output for file in file.out, and when it has to stop, it writes +any remaining uncorrected input back out to file.in (via a temporary +file.dump) and lets you edit this file. If you re-run repair on file and +file.in exists, repair will restart from there, so you may safely quit and +re-run repair as often as you like. (But if you change the input file, you +need to delete the .in file for repair to notice the change.) + +Statistics on repair's work are printed to file.log. This is an excellent +place to look to see if any characters require more training. + +As it works, repair prints the line it is working on. If you see it make a +mistake or get stuck, you can interrupt it (control-C or whatever is +appropriate), and it will immediately drop into the editor. If you interrupt +it a second time, it will exit rather than invoking the editor. If the +editor returns a non-zero result code (fails), repair will also stop. (E.g. +:cq in vim.) + +One thing that repair fixes without the least trouble is the number of +spaces expected after a printing tab character. It's such an omnipresent OCR +software error that repair doesn't even log it as a correction. + +In some cases, repair can miscorrect a line and go on to the next line, +possibly even more than once, finally giving up a few lines below the actual +error. If you are having trouble spotting the error, one helpful trick is to +exit the editor and let repair try to fix the page again, but interrupt it +while it is still working on the first line, before it has found the +miscorrection. + +The Nasty Lines + +Some lines of code, particularly those containing long runs of underscore or +minus characters, are particularly difficult to scan reliably. The repair +program has a special "nasty lines" feature to deal with this. If a file +named "nastylines" (or as specified by the -l option) exists, they are +checksummed and are considered as total replacements for any input line with +the same checksum. So, for example, if you place a blank line in the +nastylines file, any scanner noise on blank lines will be ignored. + +The "nastylines" file is re-read every time repair restarts after an edit, +so you can add more lines as the program runs. (The error-correction patterns +should be done this way, too, but that'll have to wait for the next release.) + +Sortpages + +If, in the course of scanning, the pages have been split up or have gotten +out of order, a perl script called sortpages can restore them to the proper +order. It can merge multiple input files, discard duplicates, and warns about +any missing pages it encounters. This script requires that the pages have +been repaired, so that the page headers can be read reliably. The repair +program does not care about the order it works on pages in; it examines each +page independently. Unmunge, however, does need the pages in order. + +Unmunging + +After repair has finished its work, the unmunge program strips out the +checksums and, based on the page headers, divides the data up among various +files. Its first argument is the file to unpack. The optional second argument +is a manifest file that lists all of the files and the directories they go +in. Supplying this (an excellent idea) lets unmunge recreate a directory +hierarchy and warn about missing files. + +When you have unmunged everything and reconstructed the original source code, +you are done. Unmunge verifies all of the checksums independently of repair, +as a sanity check, and you can have high confidence that the files are +exactly the same as the originals that were printed. + + + +BOOTSTRAPPING +------------- + +There's a problem using the postprocessing tools to correct OCR errors, when +the code being OCRed is the tools themselves. We've tried to provide a +reasonably easy way to get the system up and running starting from nothing +but a copy of OmniPage. + +You could just scan all of the tools in, correct any errors by hand, delete +the error-checking information in a text editor, and compile them. But +finding all the errors by hand is painful in a body of code that large. +With the aid of perl (version 5), which provides a lot of power in very +little code, we have provided some utilities to make this process easier. + +The first-stage bootstrap is a one-page perl script designed to be as small +and simple as possible, because you'll have to hand-correct it. It can verify +the checksums on each line, and drop you into the editor on any lines where +an error has occurred. It also knows how to strip out the visible spaces and +tabs, how to correct spacing errors after visible tab characters, and how to +invoke an editor on the erroneous line. + +Scan in the first-stage bootstrap as carefully as possible, using OmniPage's +warnings to guide you to any errors, and either use a text editor or the +one-line perl command at the top of the file to remove the checksums and +convert any funny printed characters to whitespace form. + +The first thing to do is try running it on itself, and correct any errors you +find this way. Note that the script writes its output to the file named in +the page header, so you should name your hand-corrected version differently +(or put it in a different directory) to avoid having it overwritten. + +The second-stage bootstrap is a much denser one-pager, with better error +detection; it can detect missing lines and missing pages, and takes an +optional second argument of a manifest file which it can use to put files +in their proper directories. It's not strictly necessary, but it's only one +more (dense) page and you can check it against itself and the original +bootstrap. + +Both of the botstrap utilities can correct tab spacing errors in the OCR +output. Although this doesn't matter in most source code, it is included +in the checksums. + +Once you have reached this point, you can scan in the C code for repair and +unmunge. The C unmunge is actually less friendly than the bootstrap +utilities, because it is only intended to work with the output of repair. +It is, however, much faster, since computing CRCs a bit at a time in an +interpreted language is painfully slow for large amounts of data. It can +also deal with binary files printed in radix-64. + + + +PRINTING +-------- + +Despite the title of this book, this process of producing a book is not well +documented, since it's been evolving up to the moment of publication. There, +is, however, a very useful working example of how to produce a book +(strikingly similar to this book) in the example directory, all controlled +by a Makefile. + +Briefly, a master perl script called psgen takes three parameters: a file +list, a page numbers file to write to, and a volume number (which should +always be 1 for a one-volume book). It runs the listed files through the +munge utility, wraps them in some simple PostScript, and prepends a prolog +that defines the special characters and PostScript functions needed by the +text. + +The file list also includes per-file flags. The most important is the +text/binary marker. Text files can also have a tab width specified, although +munge knows how to read Emacs-style tab width settings from the end of a +source file. + +The prolog is assembled from various other files and defines by psgen using +a simple preprocessor called yapp (Yet Another Preprocessor). This process +includes some book-specific information like the page footer. + +Producing the final PostScript requires the necessary non-standard fonts +(Futura for the footers and OCRB for the code) and the psutils package, +which provides the includeres utility used to embed the fonts in the +PostScript file. The fonts should go in the books/ps directory, as +"Futura.pfa" and the like. + +The pagenums file can be used to produce a table of contents. For this book, +we generated the front matter (such as this chapter) separately, told psgen +to start on the next page after this, and concatenated the resultant +PostScript files for printing. The only trick was making the page footers +look identical. diff --git a/example/.cvsignore b/example/.cvsignore new file mode 100644 index 0000000..076540c --- /dev/null +++ b/example/.cvsignore @@ -0,0 +1,3 @@ +pagenums +MANIFEST +code.ps diff --git a/example/Makefile b/example/Makefile new file mode 100644 index 0000000..a3e3a82 --- /dev/null +++ b/example/Makefile @@ -0,0 +1,23 @@ +BOOKROOT=.. +TOOLSDIR=$(BOOKROOT)/tools +PSDIR=$(BOOKROOT)/ps +YAPP=$(TOOLSDIR)/yapp +MAKEMANIFEST=$(TOOLSDIR)/makemanifest +PSGEN=BOOKROOT=$(BOOKROOT) $(TOOLSDIR)/psgen +INCLUDERES=(cd $(PSDIR); includeres) + +code.ps pagenums: filelist footer.ps MANIFEST books + $(PSGEN) -P2 -l3 -DfooterFile=footer.ps filelist pagenums 1 \ + | $(INCLUDERES) > code.ps + +books: + ln -s $(BOOKROOT) books + +MANIFEST: filelist + $(MAKEMANIFEST) $< > $@ + +clean: + rm -f `cat .cvsignore` + +gv%: %.ps + gv $< diff --git a/example/filelist b/example/filelist new file mode 100644 index 0000000..887c718 --- /dev/null +++ b/example/filelist @@ -0,0 +1,32 @@ +V 1 8 +T MANIFEST +D books/ +D books/tools/ +T books/tools/bootstrap +T books/tools/bootstrap2 +T4 books/tools/sortpages +T books/tools/Makefile +T books/tools/heap.c +T books/tools/heap.h +T books/tools/mempool.c +T books/tools/mempool.h +T books/tools/util.c +T books/tools/util.h +T books/tools/repair.c +T books/tools/subst.c +T books/tools/subst.h +T books/tools/unmunge.c +T books/tools/munge.c +T books/tools/yapp.doc +T4 books/tools/yapp +T4 books/tools/psgen +T4 books/tools/makemanifest +D books/ps/ +T books/ps/prolog.ps +T books/ps/charmap.ps +D books/example/ +T books/example/Makefile +T books/example/.cvsignore +T books/example/filelist +T books/example/footer.ps +B books/example/us-constitution.gz diff --git a/example/footer.ps b/example/footer.ps new file mode 100644 index 0000000..52f6b7b --- /dev/null +++ b/example/footer.ps @@ -0,0 +1,5 @@ +% A program to print the page footer, using the magic P function, +% which takes a string and a font. +(Tools for Publishing Source Code via OCR ) /Futura P +(\343) /Symbol P % Copyright symbol +( 1997 Pretty Good Privacy, Inc.) /Futura P diff --git a/example/us-constitution.gz b/example/us-constitution.gz new file mode 100644 index 0000000..1a058ca Binary files /dev/null and b/example/us-constitution.gz differ diff --git a/ps/charmap.ps b/ps/charmap.ps new file mode 100644 index 0000000..1602072 --- /dev/null +++ b/ps/charmap.ps @@ -0,0 +1,68 @@ +%%BeginResource: procset Latin1-vec 0 0 +/Latin1-vec [ +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/space /exclam /quotedbl /numbersign +/dollar /percent /ampersand /${rightQuoteGlyph} +/parenleft /parenright /asterisk /plus +/comma /hyphen /period /slash +/${zeroGlyph} /one /two /three +/four /five /six /seven +/eight /nine /colon /semicolon +/less /equal /greater /question +/at /A /B /C +/D /E /F /G +/H /I /J /K +/L /M /N /O +/P /Q /R /S +/T /U /V /W +/X /Y /Z /bracketleft +/backslash /bracketright /asciicircum /${underscoreGlyph} +/${leftQuoteGlyph} /a /b /c +/d /e /f /g +/h /i /j /k +/l /m /n /o +/p /q /r /s +/t /u /v /w +/x /y /z /braceleft +/${barGlyph} /braceright /tilde /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/.notdef /.notdef /.notdef /.notdef +/space /exclamdown /cent /sterling +/${tabGlyph} /yen /brokenbar /section +/dieresis /copyright /ordfeminine /guillemotleft +/logicalnot /hyphen /registered /macron +/degree /plusminus /twosuperior /threesuperior +/acute /mu /${pilcrowGlyph} /${bulletGlyph} +/cedilla /dotlessi /ordmasculine /guillemotright +/onequarter /onehalf /threequarters /questiondown +/Agrave /Aacute /Acircumflex /Atilde +/Adieresis /Aring /AE /Ccedilla +/Egrave /Eacute /Ecircumflex /Edieresis +/Igrave /Iacute /Icircumflex /Idieresis +/Eth /Ntilde /Ograve /Oacute +/Ocircumflex /Otilde /Odieresis /multiply +/Oslash /Ugrave /Uacute /Ucircumflex +/Udieresis /Yacute /Thorn /germandbls +/agrave /aacute /acircumflex /atilde +/adieresis /aring /ae /ccedilla +/egrave /eacute /ecircumflex /edieresis +/igrave /iacute /icircumflex /idieresis +/eth /ntilde /ograve /oacute +/ocircumflex /otilde /odieresis /divide +/oslash /ugrave /uacute /ucircumflex +/udieresis /yacute /thorn /ydieresis +]def +%%EndResource diff --git a/ps/prolog.ps b/ps/prolog.ps new file mode 100644 index 0000000..d3bf8c9 --- /dev/null +++ b/ps/prolog.ps @@ -0,0 +1,306 @@ +##set pageNumFont="Futura" +##set dirNameFont="Futura-Heavy" +##set fontsNeeded="${font} Symbol Futura Futura-Heavy" +##set includeFontComments=<<"END" +%%IncludeResource: font ${font} +%%IncludeResource: font Symbol +%%IncludeResource: font Futura +%%IncludeResource: font Futura-Heavy +END +##if ${font} eq Courier +##set charShrinkFactor=0.93 +##set zeroGlyph=Oslash +##set underscoreGlyph=underscore +##set bulletGlyph=bullet +##set tabGlyph=currency +##set leftQuoteGlyph=quoteleft +##set rightQuoteGlyph=quoteright +##set pilcrowGlyph=paragraph +##set barGlyph=bar +##else +##set charShrinkFactor=1 +##set zeroGlyph=Oslash +##set underscoreGlyph=underscore2 +##set bulletGlyph=bullet2 +##set tabGlyph=tabsym +##set leftQuoteGlyph=grave +##set rightQuoteGlyph=quoteright ### was "acute" +##set pilcrowGlyph=erase +##set barGlyph=orsym +##set do_custom_chars=1 +##endif +%!PS-Adobe-3.0 +%%Orientation: Portrait +%%Pages: (atend) +%%DocumentNeededResources: font ${fontsNeeded} +%%DocumentMedia: Letter 612 792 74 white () +%%EndComments +%%BeginDefaults +%%PageMedia: Letter +%%PageResources: font ${fontsNeeded} +%%EndDefaults +%%BeginProlog +%%BeginResource: procset Custom-Preamble 0 0 +% +% Document definitions +% (Upper case to avoid collisions) +% + +% 8.5x11 paper is 612x792 points, but 24 points near the edge or so +% shouldn't be used. +/Topmargin 770 def +/Leftmargin 30 def +/Rightmargin 612 Leftmargin sub def +/Botmargin 22 def +/Bindoffset 40 def + +/Lineskip -10 def +% How much to shrink characters by? +/Factor ${charShrinkFactor} def +/Fontsize 9.5 Factor mul def +% (1000 units is std height, so Courier at 6/10 aspect ratio is 600. +% Widen to make up for scaling loss. +/Charwidth + Rightmargin Leftmargin sub Bindoffset sub 87 div Fontsize div 1000 mul +def + +% Print a header (expects page number on stack) +/OddPageStart +{ save exch /MyFont findfont Fontsize scalefont setfont + /CurrentLeft Leftmargin Bindoffset add def + /CurrentRight Rightmargin def + CurrentLeft Topmargin moveto } def + +/EvenPageStart +{ save exch /MyFont findfont Fontsize scalefont setfont + /CurrentLeft Leftmargin def + /CurrentRight Rightmargin Bindoffset sub def + CurrentLeft Topmargin moveto } def + +% /MyFont findfont [Fontsize 0 0 Fontsize 0 0] makefont setfont + +% Print the name of the directory in a large font +/DirPage +{ + /${dirNameFont} findfont 14 scalefont setfont + 0 -10 rmoveto (Directory) show + CurrentLeft 30 add currentpoint exch pop 20 sub moveto show +} def + +% Advance a line +/L {show CurrentLeft currentpoint exch pop Lineskip add moveto} bind def + +% Print the "inside" footer line using P (string font => ) +% We do some magic involving redefining P to first measure the +% width of this string and then print it, so you must use it +% to do all printing. +/Foot { +##ifdef footerFile +##include "${footerFile}" +##endif +} def + +% /P is defined in the Setup section + +% Print an odd footer +/OddPageEnd + { CurrentLeft Botmargin moveto CurrentRight Botmargin lineto + 1 setlinewidth stroke + CurrentLeft Botmargin 10 sub moveto + Foot + 10 string cvs dup stringwidth + pop CurrentRight exch sub currentpoint exch pop moveto + /${pageNumFont} P + showpage + restore +} def + +% Print an even footer +/EvenPageEnd + { CurrentLeft Botmargin moveto CurrentRight Botmargin lineto + 1 setlinewidth stroke + Leftmargin Botmargin 10 sub moveto + /${pageNumFont} P + CurrentRight FootWidth sub currentpoint exch pop moveto + Foot + showpage + restore +} def + +##ifdef do_custom_chars +% A 1000-point OCRB discunderline consists of: +% 111.45 -173.688 moveto +% 609.356 -173.688 lineto +% 609.356 -70.9227 lineto +% 111.45 -70.9227 lineto +% closepath +% 720.0 -0.0 moveto +% Line thickness is +% 102.7653 pts. + +% This would suggest the following values: +/underleft 111.45 def +/underright 609.356 def +/underthick 102.7643 def +/underup underthick def +/underdown 0 def +/underserif 25 def + +% These look better in GhostScript, but not on a real Adobe rasterizer +%/underright 600 def +%/underleft 100 def +%/underthick 75 def + +171 +211 +36081 +% The default bullet character is +% 254.0 341.0 moveto +% 254.0 170.0 lineto +% 465.0 170.0 lineto +% 465.0 341.0 lineto +% closepath +% Our modified version is based on: +/bullwid 204 def +/bullht 176.75 def +/bullleft 254 341 add bullwid sub 2 div def +/bullright 254 341 add bullwid add 2 div def +/bullbot 254 def +/bulltop bullbot bullht add def + +% And a custom-created tab symbol +/tableft 250 def +/tabright 550 def +/tabtop 550 def +/tabbot 50 def +/tablinewidth 35 def + +% Let's try a vertical bar +% OCRB defines (|) +% 411.062 -173.688 moveto +% 411.062 741.043 lineto +% 308.297 741.043 lineto +% 308.297 -173.688 lineto +% closepath +% 720.0 -0.0 moveto +/orleft 308.297 def +/orright 411.062 def +/orbot -173.688 def +/ortop 741.043 def +/orbreak 150 def % Width of break +/orbbot ortop orbot add orbreak sub 2 div def % Bottom of break +/orbtop ortop orbot add orbreak add 2 div def % Top of break +##endif + +% newfontname encoding-vec fontname -> - make a new encoded font +/MF2 { + % Make a dict for the new font, with room for the /Metrics + findfont dup length 1 add dict begin + % Copy everything except the FID entry + {1 index /FID eq {pop pop} {def} ifelse} forall + % Set the encoding vector + /Encoding exch def + +##ifdef do_custom_chars + % Create a new expanded CharStrings dictionary + CharStrings dup length 5 add dict + begin { def } forall + % Create a custom underscore character + /underscore2 { + pop + //Charwidth 0 % width, bounding box follows + //underleft //underdown neg //underright //underthick //underup add + setcachedevice + //underleft //underthick //underup add moveto + //underleft //underserif add //underthick //underup add lineto + //underleft //underserif add //underthick lineto + //underright //underserif sub //underthick lineto + //underright //underserif sub //underthick //underup add lineto + //underright //underthick //underup add lineto + //underright //underdown neg lineto + //underright //underserif sub //underdown neg lineto + //underright //underserif sub 0 lineto + //underleft //underserif add 0 lineto + //underleft //underserif add //underdown neg lineto + //underleft //underdown neg lineto + closepath fill + } bind def + % Create a custom bullet character. + /bullet2 { + pop + //Charwidth 0 % width, bounding box follows + //bullleft //bullbot //bullright //bulltop + setcachedevice + //bullleft //bullbot moveto + //bullleft bullright add 2 div bulltop lineto + //bullright //bullbot lineto + closepath fill + } bind def + % Create a custom tab character. + /tabsym { + pop + //Charwidth 0 % width, bounding box follows + //tableft //tablinewidth sub //tabbot //tablinewidth sub + //tabright //tablinewidth add //tabtop //tablinewidth add + setcachedevice + //tablinewidth setlinewidth + true setstrokeadjust + 0 setlinejoin + //tableft //tabbot moveto + //tabright //tabtop //tabbot add 2 div lineto + //tableft //tabtop lineto + closepath stroke + } bind def + /orsym { + pop + //Charwidth 0 % width, bounding box follows + //orleft //orbot //orright //ortop + setcachedevice + //orleft //orbot moveto + //orleft //orbbot lineto + //orright //orbbot lineto + //orright //orbot lineto + closepath + //orleft //ortop moveto + //orleft //orbtop lineto + //orright //orbtop lineto + //orright //ortop lineto + closepath fill + } bind def + /CharStrings currentdict end def +##endif + + % Create a new dict to be the /Metrics values + CharStrings dup length dict + % Now fill in the metrics dict with the desired width + begin { pop Charwidth def } forall /Metrics currentdict end def + % End of definitions + currentdict end + % Define the font + definefont pop +} bind def + +% Check PostScript language level. +/gs_languagelevel /languagelevel where { pop languagelevel } { 1 } ifelse def + +%%EndResource +##include "charmap.ps" +${includeFontComments} +%%EndProlog + + +%%BeginSetup + +/MyFont Latin1-vec /${font} MF2 +/#copies 1 def + +% Compute the width of the /Foot string, by defining P to +% add up the x-width of the characters. +/P { findfont 9 scalefont setfont stringwidth pop add } def +/FootWidth 0 Foot def +% Redefine P to print, as usual +/P { findfont 9 scalefont setfont show } def +%%BeginResource: procset foo 0 0 +% This is an example +%%EndResource +%%EndSetup diff --git a/tools/Makefile b/tools/Makefile new file mode 100644 index 0000000..138d5b7 --- /dev/null +++ b/tools/Makefile @@ -0,0 +1,30 @@ +all: unmunge repair munge + +OPT = -g -O -W -Wall +COMMON_OBJS = util.o + +UNMUNGE_OBJS = $(COMMON_OBJS) unmunge.o +MUNGE_OBJS = $(COMMON_OBJS) munge.o +REPAIR_OBJS = $(COMMON_OBJS) heap.o mempool.o subst.o repair.o + +unmunge: $(UNMUNGE_OBJS) + $(CC) $(OPT) -o $@ $(UNMUNGE_OBJS) + +munge: $(MUNGE_OBJS) + $(CC) $(OPT) -o $@ $(MUNGE_OBJS) + +repair: $(REPAIR_OBJS) + $(CC) $(OPT) -o $@ $(REPAIR_OBJS) + +.c.o: + $(CC) $(OPT) -o $@ -c $< + +clean: + -rm -f *.o munge unmunge repair core *.core + +unmunge.o: util.h +munge.o: util.h +repair.o: heap.h mempool.h util.h subst.h +heap.o: heap.h +mempool.o: mempool.h +subst.o: subst.h diff --git a/tools/bootstrap b/tools/bootstrap new file mode 100644 index 0000000..768aae5 --- /dev/null +++ b/tools/bootstrap @@ -0,0 +1,68 @@ +#!/usr/bin/perl -s +# +# bootstrap -- Simpler version of unmunge for bootstrapping +# +# Unmunge this file using: +# perl -ne 'if (s/^ *[^-\s]\S{4,6} ?//) { s/[\244\245\267]/ /g; print; }' +# +# $Id: bootstrap,v 1.15 1997/11/14 03:52:53 mhw Exp $ + +sub Fatal { print STDERR @_; exit(1); } +sub Max { my ($a, $b) = @_; ($a > $b) ? $a : $b; } +sub TabSkip { $tabWidth - 1 - (length($_[0]) % $tabWidth); } + +($tab,$yen,$pilc,$cdot,$tmp1,$tmp2)=("\244","\245","\266","\267","\377","\376"); +$editor = $ENV{'VISUAL'} || $ENV{'EDITOR'} || 'vi'; +$inFile = $ARGV[0]; +doFile: { + open(IN, "<$inFile") || die; + for ($lineNum = 1; ($_ = ); $lineNum++) { + s/^\s+//; s/\s+$//; # Strip leading and trailing spaces + next if (/^$/); # Ignore blank lines + ($prefix, $seenCRCStr, $dummy, $_) = /^(\S{2})(\S{4})( (.*))?/; + + # Correct the number of spaces after each tab + while (s/$tab( *)/$tmp1 . ($tmp2 x &Max(length($1), &TabSkip($`)))/e) {} + s/ ( +)/" " . ($cdot x length($1))/eg; # Correct center dots + s/$tmp1/$tab/g; s/$tmp2/ /g; # Restore tabs and spaces from correction + s/\s*$/\n/; # Strip trailing spaces, and add a newline + + $crc = $seenCRC = 0; # Calculate CRC + for ($data = $_; $data ne ""; $data = substr($data, 1)) { + $crc ^= ord($data); + for (1..8) { + $crc = ($crc >> 1) ^ (($crc & 1) ? 0x8408 : 0); + } + } + if ($crc != hex($seenCRCStr)) { # CRC mismatch + close(IN); close(OUT); + unlink(@filesCreated); + @filesCreated = (); + @oldStat = stat($inFile); + system($editor, "+$lineNum", $inFile); + @newStat = stat($inFile); + redo doFile if ($oldStat[9] != $newStat[9]); # Check mod date + &Fatal("Line $lineNum invalid: $_"); + } + + if ($prefix eq '--') { # Process header line + ($code, $pageNum, $file) = /^(\S{19}) Page (\d+) of (.*)/; + $tabWidth = hex(substr($code, 11, 1)); + if ($file ne $lastFile) { + print "$file\n"; + &Fatal("$file: already exists\n") if (!$f && (-e $file)); + close(OUT); + open(OUT, ">$file") || &Fatal("$file: $!\n"); + push(@filesCreated, ($lastFile = $file)); + } + } else { # Unmunge normal line + s/$tab( *)/"\t".(" " x (length($1) - &TabSkip($`)))/eg; + s/$yen\n/\f/; # Handle form feeds + s/$pilc\n//; # Handle continuation lines + s/$cdot/ /g; # Center dots -> spaces + + print OUT; + } + } + close(IN); close(OUT); +} diff --git a/tools/bootstrap2 b/tools/bootstrap2 new file mode 100644 index 0000000..4bba127 --- /dev/null +++ b/tools/bootstrap2 @@ -0,0 +1,72 @@ +#!/usr/bin/perl -s +# +# bootstrap2 -- Second stage bootstrapper, a version of unmunge +# +# $Id: bootstrap2,v 1.4 1997/11/14 03:52:54 mhw Exp $ + +sub Cleanup { close(IN); close(OUT); unlink(@files); @files = (); } +sub Fatal { &Cleanup(); print STDERR @_; exit(1); } +sub TabSkip { $tabWidth - 1 - (length($_[0]) % $tabWidth); } +sub TabFix { my ($needed, $actual) = (&TabSkip($_[0]), length($_[1])); + $tmp1 . ($tmp2 x $needed) . (" " x ($actual - $needed)); } +sub HumanEdit { my ($file, $line, @message) = ($inFile, @_); &Cleanup(); + @old = stat($file); system($editor, "+$line", $file); @new = stat($file); + redo doFile if ($old[9] != $new[9]); # Check mod date + &Fatal("Line $line, ", @message); } + +($tab,$yen,$pilc,$cdot,$tmp1,$tmp2)=("\244","\245","\266","\267","\377","\376"); +$editor = $ENV{'VISUAL'} || $ENV{'EDITOR'} || 'vi'; +($inFile, $manifest, @rest) = @ARGV; +if ($manifest ne "") { # Read manifest file + open(MANIFEST, "<$manifest") || &Fatal("$manifest: $!\n"); + while () { $dir = $1 if /^D\s+(.*)$/; + $index[$1] = $dir . $2 if /^(\d+)\s+(.*)$/; } +} +doFile: { + $seenPCRC = $pcrc1 = 0; $lastFlags = 1; $lastFileNum = 0; + open(IN, "<$inFile") || &Fatal("$inFile: $!\n"); + for ($line = 1; ($_ = ); $line++) { + s/^\s+//; s/\s+$//; # Strip leading and trailing spaces + next if (/^$/); # Ignore blank lines + ($prefix, $seenCRCStr, $dummy, $_) = /^(\S{2})(\S{4})( (.*))?/; + while (s/$tab( *)/&TabFix($`, $1)/eo) {} # Correct spaces after tabs + s/($tmp2| )( +)/$1 . ($cdot x length($2))/ego; # Correct center dots + s/$tmp1/$tab/go; s/$tmp2/ /go; # Restore tabs/spaces from correction + s/\s*$/\n/; # Strip trailing spaces, and add a newline + + $crc = 0; $pcrc = $pcrc1; # Calculate CRCs + for ($data = $_; $data ne ""; $data = substr($data, 1)) { + $crc ^= ord($data); $pcrc1 ^= ord($data); + for (1..8) { $crc = ($crc >> 1) ^ (($crc & 1) ? 0x8408 : 0); + $pcrc1 = ($pcrc1 >> 1) ^ (($pcrc1 & 1) ? 0xedb88320 : 0); } + } + ($seenPLCRC, $seenCRC) = map { hex($_) } ($prefix, $seenCRCStr); + &HumanEdit($line, "CRC failed: $_") if $crc != $seenCRC; + if ($prefix eq '--') { # Process header line + &HumanEdit($line - 1, "Page CRC failed") if $pcrc != $seenPCRC; + ($humanHdr, $pageNum, $file) = /^\S{19} (Page (\d+) of (.*))/; + ($vers, $flags, $seenPCRC, $tabWidth, $prodNum, $fileNum) = + map { hex($_) } /^(\S)(\S\S)(\S{8})(\S)(\S{3})(\S{4})/; + if ($fileNum != $lastFileNum) { + print STDERR "MISSING files\n" if $fileNum != $lastFileNum + 1; + &Fatal("Missing pages\n") if $pageNum != 1 || !($lastFlags & 1); + if ($manifest ne "") { + ($_ = $index[$fileNum]) =~ m%([^/]*)$%; + &Fatal("Manifest mismatch\n") if ($file ne $1); + ($file = $_) =~ s|/+|mkdir($`, 0777), "/"|eg; # mkdir -p + } + &Fatal("$file: already exists\n") if (!$f && (-e $file)); + close(OUT); open(OUT, ">$file") || &Fatal("$file: $!\n"); + push(@files, $file); print "$fileNum $file\n"; + } else { + &Fatal("MISSING pages\n") if ($pageNum != $lastPageNum + 1); + } + ($lastFlags,$lastFileNum,$lastPageNum) = ($flags,$fileNum,$pageNum); + $pcrc1 = 0; + } else { # Unmunge normal line + &HumanEdit($line, "CRC failed: $_") if ($pcrc1 >> 24) != $seenPLCRC; + s/$tab( *)/"\t".(" " x (length($1) - &TabSkip($`)))/ego; + s/$yen\n/\f/o; s/$pilc\n//o; s/$cdot/ /go; print OUT; + } + } +} diff --git a/tools/heap.c b/tools/heap.c new file mode 100644 index 0000000..6d0474c --- /dev/null +++ b/tools/heap.c @@ -0,0 +1,144 @@ +/* + * heap.c -- Simple priority queue. Takes pointers to cost values + * (presumably the first field in a larger structure) and returns + * them in increasing order of cost. + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Written by Colin Plumb and Mark H. Weaver + * + * $Id: heap.c,v 1.2 1997/07/05 02:55:23 colin Exp $ + */ + +#include /* For fprintf(stderr, "Out of memory") */ +#include /* For malloc() & co. */ + +#include "heap.h" + +#define HeapParent(i) ((i) / 2) +#define HeapLeftChild(i) ((i) * 2) +#define HeapRightChild(i) ((i) * 2 + 1) +#define HeapElem(h, i) (h)->elems[i] +#define HeapMinElem(h) HeapElem(h, 1) +#define HeapElemCost(e) (*(e)) +#define HeapCost(h, i) HeapElemCost(HeapElem(h, i)) +#define HeapSize(h) ((h)->numElems) + +static void +SiftDown(Heap const *heap, HeapCost *e) +{ + HeapIndex size = HeapSize(heap), parent = 1, child; + HeapCost cparent = HeapElemCost(e), cchild; + + for (;;) { + child = 2*parent; + if (child > size) + break; + cchild = HeapCost(heap, child); + if (child < size && cchild > HeapCost(heap, child+1)) { + cchild = HeapCost(heap, child+1); + child++; + } + if (cparent <= cchild) + break; /* Stop sifting down */ + HeapElem(heap, parent) = HeapElem(heap, child); + parent = child; + } + HeapElem(heap, parent) = e; +} + +/* Debug tool: verify heap property */ +void +HeapVerify(Heap *heap) +{ + HeapIndex i; + + for (i = 2; i <= HeapSize(heap); i++) + if (HeapCost(heap, i) < HeapCost(heap, HeapParent(i))) + fprintf(stderr, "DEBUG: VerifyHeap failed at elem %d\n", i); +} + +/* Remove and return the minimum cost from the heap. */ +HeapCost * +HeapGetMin(Heap *heap) +{ + HeapIndex lastElem = HeapSize(heap); + HeapCost *retval; + + if (!lastElem) + return NULL; + retval = HeapMinElem(heap); + HeapSize(heap) = lastElem-1; + SiftDown(heap, HeapElem(heap, lastElem)); + return retval; +} + +/* Helper - set heap size, reallocating if needed */ +static void +HeapResize(Heap *heap, HeapIndex newNumElems) +{ + if (newNumElems >= heap->elemsAllocated) { + HeapIndex newAllocSize = heap->elemsAllocated * 2; + + if (newAllocSize <= newNumElems) + newAllocSize = newNumElems + 1; + heap->elems = (HeapCost **)realloc((void *)heap->elems, + sizeof(*heap->elems) * newAllocSize); + if (heap->elems == NULL) { + fprintf(stderr, "Fatal error: Out of memory growing heap\n"); + exit(1); + } + heap->elemsAllocated = newAllocSize; + } + heap->numElems = newNumElems; +} + +/* Add an element to the heap */ +void +HeapInsert(Heap *heap, HeapCost *newElem) +{ + HeapIndex parent, i = ++HeapSize(heap); + HeapCost cost = HeapElemCost(newElem); + + HeapResize(heap, i); + /* Sift up until parent = 0 */ + while ((parent = HeapParent(i)) && HeapCost(heap, parent) > cost) { + HeapElem(heap, i) = HeapElem(heap, parent); + i = parent; + } + heap->elems[i] = newElem; +} + +/* Initialize a new heap */ +void +HeapInit(Heap *heap, HeapIndex initSize) +{ + initSize++; /* Add one for temporary element */ + if (initSize < 1) + initSize = 1; + heap->elems = (HeapCost **)malloc(initSize * sizeof(*heap->elems)); + if (heap->elems == NULL) { + fprintf(stderr, "Fatal error: Out of memory creating heap\n"); + exit(1); + } + heap->elemsAllocated = initSize; + heap->numElems = 0; +} + +/* Free up a heap's resources. */ +void +HeapDestroy(Heap *heap) +{ + free((void *)heap->elems); + heap->elemsAllocated = 0; + heap->numElems = 0; + heap->elems = NULL; +} + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ diff --git a/tools/heap.h b/tools/heap.h new file mode 100644 index 0000000..36e8782 --- /dev/null +++ b/tools/heap.h @@ -0,0 +1,43 @@ +/* + * heap.h -- Simple priority queue. Takes pointers to cost values + * (presumably the first field in a larger structure) and returns + * them in increasing order of cost. + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Written by Colin Plumb and Mark H. Weaver + * + * $Id: heap.h,v 1.6 1997/10/31 04:22:46 mhw Exp $ + */ + +#ifndef HEAP_H +#define HEAP_H 1 + +#include +#include +#include + +typedef int HeapCost; +#define COST_INFINITY INT_MAX +typedef unsigned HeapIndex; + +typedef struct Heap { + HeapCost **elems; + HeapIndex numElems, elemsAllocated; +} Heap; + +void HeapInit(Heap *heap, HeapIndex initSize); +void HeapDestroy(Heap *heap); +void HeapInsert(Heap *heap, HeapCost *newElem); +HeapCost *HeapGetMin(Heap *heap); +void HeapVerify(Heap *heap); + +#endif + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ diff --git a/tools/makemanifest b/tools/makemanifest new file mode 100644 index 0000000..4e8dcc8 --- /dev/null +++ b/tools/makemanifest @@ -0,0 +1,31 @@ +#!/usr/bin/perl + +$fileNum = 0; +while(<>) +{ + /^([VDTB])(\S*)\s+(.*)/ || die("Bad filelist, line $."); + ($type, $options, $name) = ($1, $2, $3); + + if ($type eq "D") + { + $dir = $name; + print "D $dir\n"; + } + elsif ($type eq "V") + { + # Do nothing + } + else + { + $fileNum++; + $tail = $name; + $tail =~ s|^.*/||; + die("Bad filelist, line $.") if $name ne $dir . $tail; + print "$fileNum $tail\n"; + } +} + +# +# vi: ai ts=4 +# vim: si +# diff --git a/tools/mempool.c b/tools/mempool.c new file mode 100644 index 0000000..40e3104 --- /dev/null +++ b/tools/mempool.c @@ -0,0 +1,137 @@ +/* + * mempool.c - Pooled memory allocation, similar to GNU obstacks. + * + * $Id: mempool.c,v 1.5 1997/11/13 23:53:08 colin Exp $ + */ +#include +#include +#include +#include /* For malloc() & free() */ + +#include "mempool.h" + +/* + * The memory pool allocation functions + * + * These are based on a linked list of memory blocks, usually of uniform + * size. New memory is allocated from the tail of the current block, + * until that is inadequate, then a new block is allocated. + * The entire pool can be freed at once by calling memPoolFree(). + */ +struct PoolBuf { + struct PoolBuf *next; + unsigned size; + /* Data follows */ +}; + +/* The prototype empty pool, including the default allocation size. */ +static struct MemPool EmptyPool = { 0, 0, 0, 4096, 0 , 0, 0}; + +/* Initialize the pool for first use */ +void +memPoolInit(struct MemPool *pool) +{ + *pool = EmptyPool; +} + +/* Set the pool's purge function */ +void +memPoolSetPurge(struct MemPool *pool, int (*purge)(void *), void *arg) +{ + pool->purge = purge; + pool->purgearg = arg; +} + +/* Free all the memory in the pool */ +void +memPoolEmpty(struct MemPool *pool) +{ + struct PoolBuf *buf; + + while ((buf = pool->head) != 0) { + pool->head = buf->next; + free(buf); + } + pool->freespace = 0; + pool->totalsize = 0; +} + + +/* + * Restore a pool to a marked position, freeing subsequently allocated + * memory. + */ +void +memPoolCutBack(struct MemPool *pool, struct MemPool const *cutback) +{ + struct PoolBuf *buf; + + assert(pool); + assert(cutback); + assert(pool->totalsize >= cutback->totalsize); + + while((buf = pool->head) != cutback->head) { + pool->head = buf->next; + free(buf); + } + *pool = *cutback; +} + +/* + * Allocate a chunk of memory for a structure. Alignment is assumed to be + * a power of 2. It could be generalized, if that ever becomes relevant. + * Note that alignment is from the beginning of an allocated chunk, which + * is guaranteed by ANSI to be as aligned as can possibly matter. + */ +void * +memPoolAlloc(struct MemPool *pool, unsigned len, unsigned alignment) +{ + char *p; + unsigned t; + + /* Where to allocate next object */ + p = pool->freeptr; + /* How far it is from the beginning of the chunk. */ + t = p - (char *)pool->head; + /* How much to round up freeptr to make alignment */ + t = -t & --alignment; + + /* Okay, does it fit? */ + if (pool->freespace >= len+t) { + pool->freespace -= len+t; + p += t; + pool->freeptr = p + len; + return p; + } + + /* It does not fit in the current chunk. Go for a bigger chunk. */ + + /* First, figure out how much to skip at the beginning of the chunk */ + alignment &= -(unsigned)sizeof(struct PoolBuf); + alignment += sizeof(struct PoolBuf); + /* Then, figure out a chunk size that will fit */ + t = pool->chunksize; + assert(t); + while (len + alignment > t) + t *= 2; + while ((p = malloc(t)) == 0) { + /* If that didn't work, try purging or smaller allocations */ + if (!pool->purge || !pool->purge(pool->purgearg)) { + t /= 2; + if (len + alignment > t) + fputs("Out of memory!\n", stderr); + exit (1); /* Failed */ + } + } + + /* Update the various pointers. */ + pool->totalsize += t; + ((struct PoolBuf *)p)->next = pool->head; + ((struct PoolBuf *)p)->size = t; + pool->head = (struct PoolBuf *)p; + pool->freespace = t - len - alignment; + p += alignment; + pool->freeptr = p + len; + + return p; +} diff --git a/tools/mempool.h b/tools/mempool.h new file mode 100644 index 0000000..1732a77 --- /dev/null +++ b/tools/mempool.h @@ -0,0 +1,36 @@ +/* $Id: mempool.h,v 1.2 1997/11/13 23:53:09 colin Exp $ */ + +#ifndef MEMPOOL_H +#define MEMPOOL_H + +typedef struct MemPool { + struct PoolBuf *head; + char *freeptr; + unsigned freespace; + unsigned chunksize; /* Default starting point */ + unsigned long totalsize; + int (*purge)(void *); /* Return non-zero to retry alloc */ + void *purgearg; +} MemPool; + +/* A global pool for miscellaneous stuff. */ +extern struct MemPool MiscPool; + +/* + * Nice clean interfaces + */ +void memPoolInit(struct MemPool *pool); +void memPoolSetPurge(struct MemPool *pool, int (*purge)(void *), void *arg); +void memPoolEmpty(struct MemPool *pool); +void memPoolCutBack(struct MemPool *dest, struct MemPool const *cutback); +void *memPoolAlloc(struct MemPool *pool, unsigned len, unsigned alignment); +#ifdef DEADCODE +char const *memPoolStore(struct MemPool *pool, char const *str); +#endif + +/* Lookie here! An ASNI-compliant alignment finder! */ +#define alignof(type) (sizeof(struct{type _x; char _y;}) - sizeof(type)) + +#define memPoolNew(pool, type) memPoolAlloc(pool, sizeof(type), alignof(type)) + +#endif /* MEMPOOL_H */ diff --git a/tools/munge.c b/tools/munge.c new file mode 100644 index 0000000..965e25a --- /dev/null +++ b/tools/munge.c @@ -0,0 +1,543 @@ +/* + * munge.c -- Program to convert a text file into "munged" form, + * suitable for reconstruction from printed form. Tabs are + * made visible and checksums are added to each line and each + * page to protect against transcription errors. + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Designed by Colin Plumb, Mark H. Weaver, and Philip R. Zimmermann + * Written by Mark H. Weaver + * + * $Id: munge.c,v 1.32 1997/11/12 23:28:53 mhw Exp $ + */ + +#include +#include +#include +#include +#include + +#include "util.h" + +/* + * The file is divided into pages, and the format of each page is + * +--f414 000b2dc79af40010002 Page 1 of munge.c + +bc38e5 /* +40a838 * munge.c -- Program to convert a text file into munged form +647222 * +193f28 * Copyright (C) 1997 Pretty Good Privacy, Inc. +827222 * +699025 * Designed by Colin Plumb, Mark H. Weaver, and Philip R. Zimmermann +0d050c * Written by Mark H. Weaver + * + * Where the first 2 columns are the high 8 bits (in hex) of a running + * CRC-32 of the page (the string "--", unlikely to be confused with + * any digits, indicates a page header line) and the next 4 columns + * are a CRC-16 of the rest of the line. Then a space (not counted in + * the CRC), and the line of text. Tabs are printed as the currency + * symbol (ISO Latin 1 character 164) followed by the appropriate number + * of spaces, and any form feeds are printed as a yen symbol (Latin 1 165). + * The CRC is computed on the transformed line, including the trailing + * newline. No trailing whitespace is permitted. + * + * The header line contains a (hex) number of the form 0ffcccccccctpppnnnn, + * where the digit 0 is a version number, ff are flags, ccccccc is the CRC-32 + * of the page, t is the tab size (usually 4 or 8; 0 for binary files that + * are sent in radix-64), ppp is the product number (usually 1, different + * for different books), and nnnn is the file number (sequential from 1). + * + * This is followed by " Page %u of " and the file name. + */ + +typedef struct MungeState +{ + EncodeFormat const * fmt; + EncodeFormat const * hFmt; + int binaryMode, tabWidth; + long origLineNumber; + long productNumber, fileNumber, pageNumber, lineNumber; + unsigned long fileOffset; + CRC pageCRC; + char const * fileName; + char const * fileNameTail; + char * pageBuffer; /* Buffer large enough to hold one page */ + char * pagePos; /* Current position in pageBuffer */ + word16 hdrFlags; + FILE * file; + FILE * out; +} MungeState; + + +void ChecksumLine(EncodeFormat const *fmt, char const *line, size_t length, + char *prefix, CRC *pageCRC) +{ + CRC lineCRC; + CRC runCRCPart = 0; + + lineCRC = CalculateCRC(fmt->lineCRC, 0, (byte const *)line, length); + if (pageCRC != NULL) + { + *pageCRC = CalculateCRC(fmt->pageCRC, *pageCRC, + (byte const *)line, length); + runCRCPart = RunningCRCFromPageCRC(fmt, *pageCRC); + } + + prefix += EncodeCheckDigits(fmt, runCRCPart, fmt->runningCRCBits, prefix); + prefix += EncodeCheckDigits(fmt, lineCRC, fmt->lineCRC->bits, prefix); + + *prefix++ = ' '; /* Write a space over the null byte */ +} + +/* Returns 1 for convenience */ +int PrintFileError(MungeState *state, char const *message) +{ + fprintf(stderr, "%s in %s %s %lu\n", message, state->fileName, + state->binaryMode ? "offset" : "line", + state->binaryMode ? state->fileOffset : state->origLineNumber); + return 1; +} + +int MungeLine(MungeState *state, char *buffer, int length, + char *line, int *bufferUsed) +{ + int i = 0, j = 0, jOld = 0; + char ch; + + for (i = 0; i < length && j < LINE_LENGTH; i++) + { + jOld = j; + ch = buffer[i]; + if (ch == '\t') + { + line[j++] = TAB_CHAR; + if (state->tabWidth < 1) + return PrintFileError(state, + "ERROR: Tab found in radix64 stream"); + else + while (j % state->tabWidth && j < LINE_LENGTH) + line[j++] = TAB_PAD_CHAR; + } + else if (ch == '\n') + { + if (i + 1 < length) + return PrintFileError(state, + "UNEXPECTED ERROR: fgets read past newline!?"); + break; + } + else if (ch == '\f') + { + break; + } + else if (ch == ' ' && (j <= 0 || line[j-1] == ' ' || + line[j-1] == SPACE_CHAR || + i+1 >= length || buffer[i+1] == '\n')) + { + line[j++] = SPACE_CHAR; + } + else if (ch >= ' ' && ch <= '~') + line[j++] = ch; + else + return PrintFileError(state, "ERROR: Non-ASCII char"); + } + + if (i < length && buffer[i] == '\n') + { + i++; + state->origLineNumber++; + } + else if (i < length && buffer[i] == '\f' && j < LINE_LENGTH) + { + i++; + line[j++] = FORMFEED_CHAR; + } + else + { + /* If there's no newline, we need to add the continuation marker */ + if (i > 0 && j >= LINE_LENGTH) + { + /* Remove the last character if we're out of room */ + i--; + j = jOld; + } + line[j++] = CONTIN_CHAR; + } + + /* Strip trailing spaces */ + while (j > 0 && isspace((unsigned char)line[j - 1])) + j--; + + if (j > LINE_LENGTH) /* This should never happen */ + return PrintFileError(state, "ERROR: Internal error, line too long"); + + /* Add trailing newline and NULL */ + line[j++] = '\n'; + line[j++] = '\0'; + + /* Return number of chars used from buffer */ + *bufferUsed = i; + + return 0; +} + +static void +Encode3(byte const src[3], char dest[4]) +{ + dest[0] = radix64Digits[ (src[0]>>2 & 0x3f)]; + dest[1] = radix64Digits[(src[0]<<4 & 0x30) | (src[1]>>4 & 0x0f)]; + dest[2] = radix64Digits[(src[1]<<2 & 0x3c) | (src[2]>>6 & 0x03)]; + dest[3] = radix64Digits[(src[2] & 0x3f)]; +} + +static int +EncodeLine(byte const *src, int srcLen, char *dest) +{ + char * destp = dest; + byte tempSrc[3]; + + for (; srcLen >= 3; srcLen -= 3) + { + Encode3(src, destp); + src += 3; destp += 4; + } + + if (srcLen > 0) + { + memset(tempSrc, 0, sizeof(tempSrc)); + memcpy(tempSrc, src, srcLen); + Encode3(src, destp); + src += 3; destp += 4; srcLen -= 3; + while (srcLen < 0) + destp[srcLen++] = RADIX64_END_CHAR; + } + + return destp - dest; +} + +static int +MungeBinaryLine(MungeState *state, byte const *buffer, int length, char *line) +{ + char binLine[128]; + int binLength; /* Destination length */ + int used; + + binLength = EncodeLine(buffer, length, binLine); + + /* Append newline */ + binLine[binLength++] = '\n'; + binLine[binLength] = '\0'; + + return MungeLine(state, binLine, binLength, line, &used); +} + +int MaybePageBreak(MungeState *state) +{ + EncodeFormat const * fmt = state->fmt; + EncodeFormat const * hFmt = state->hFmt; + + if (state->lineNumber >= LINES_PER_PAGE) + { + char line[512]; + char * lineData = line + PREFIX_LENGTH; + char * p = lineData; + + p += EncodeCheckDigits(hFmt, 0, HDR_VERSION_BITS, p); + p += EncodeCheckDigits(hFmt, state->hdrFlags, HDR_FLAG_BITS, p); + p += EncodeCheckDigits(hFmt, state->pageCRC, fmt->pageCRC->bits, p); + p += EncodeCheckDigits(hFmt, state->tabWidth, HDR_TABWIDTH_BITS, p); + p += EncodeCheckDigits(hFmt, state->productNumber, HDR_PRODNUM_BITS, p); + p += EncodeCheckDigits(hFmt, state->fileNumber, HDR_FILENUM_BITS, p); + + sprintf(p, " Page %ld of %s\n", state->pageNumber + 1, + state->fileNameTail); + + if (strlen(lineData) > LINE_LENGTH + 1) + { + PrintFileError(state, "ERROR: Header line too long"); + fprintf(stderr, "> %s", lineData); + return -1; + } + + /* Compute checksums and prefix them to line */ + ChecksumLine(fmt, lineData, strlen(lineData), line, NULL); + + fprintf(state->out, "%c%c%s\n%s\f", HDR_PREFIX_CHAR, + fmt->headerTypeChar, line + 2, state->pageBuffer); + + state->pageNumber++; + state->lineNumber = 0; + state->pageCRC = 0; + state->pagePos = state->pageBuffer; /* Clear page buffer */ + } + return 0; +} + +/* + * Search for Emacs "tab-width: " maker in file. + * Emacs is stricter about the format, but this will do. + */ +int FindTabWidth(MungeState *state) +{ + char const * const tabWidthMarker = " tab-width: "; + char buffer[512]; + char * p; + int length; + int tabWidth = 0; + + fseek(state->file, -(sizeof(buffer) - 1), SEEK_END); + length = fread(buffer, 1, sizeof(buffer) - 1, state->file); + buffer[length] = '\0'; + p = strstr(buffer, tabWidthMarker); + if (p != NULL) + { + p += strlen(tabWidthMarker); + while (*p != '\0' && *p != '\n' && isspace(*p)) + p++; + tabWidth = strtol(p, &p, 10); + while (*p != '\0' && *p != '\n' && isspace(*p)) + p++; + if (*p != '\n' || tabWidth < 2) + tabWidth = 0; + else if (tabWidth > 16) + fprintf(stderr, "WARNING: Weird tab-width (%d), %s\n", + tabWidth, state->fileName); + } + return tabWidth; +} + +/* + * Open the given source file and send the munged output to the + * FILE *, with the given options. + */ +int MungeFile(char const *fileName, FILE *out, EncodeFormat const *fmt, + int binaryMode, int defaultTabWidth, + long productNumber, long fileNumber) +{ + MungeState * state; + int length, used; + char line[PREFIX_LENGTH + LINE_LENGTH + 10]; + char * lineData = line + PREFIX_LENGTH; + char buffer[128]; + int result = 0; + + state = (MungeState *)calloc(1, sizeof(*state)); + state->fmt = fmt; + state->hFmt = &hexFormat; + state->origLineNumber = 1; + state->fileName = fileName; + state->pageCRC = 0; + state->productNumber = productNumber; + state->fileNumber = fileNumber; + state->pageNumber = 0; + state->lineNumber = 0; + state->fileOffset = 0; + state->binaryMode = binaryMode; + state->pageBuffer = malloc(PAGE_BUFFER_SIZE); + state->pageBuffer[0] = '\0'; + state->pagePos = state->pageBuffer; + state->hdrFlags = 0; + state->out = out; + + state->fileNameTail = strrchr(state->fileName, '/'); + if (state->fileNameTail == NULL) + state->fileNameTail = state->fileName; + else + state->fileNameTail++; + + state->file = fopen(state->fileName, binaryMode ? "rb" : "r"); + if (state->file == NULL) + { + result = errno; + fprintf(stderr, "ERROR opening %s: %s\n", + state->fileName, strerror(result)); + goto error; + } + + if (state->binaryMode) + { + state->tabWidth = 0; + } + else + { + state->tabWidth = FindTabWidth(state); + if (state->tabWidth == 0) + state->tabWidth = defaultTabWidth; + rewind(state->file); + } + + while (!feof(state->file)) + { + if (state->binaryMode) + { + length = fread(buffer, 1, BYTES_PER_LINE, state->file); + if (length < 1) + { + if (feof(state->file)) + break; + goto fileError; + } + if ((result = MaybePageBreak(state))) + goto error; + if ((result = MungeBinaryLine(state, buffer, length, lineData))) + goto error; + state->fileOffset += length; + } + else + { + if (fgets(buffer, sizeof(buffer), state->file) == NULL) + { + if (feof(state->file)) + break; + goto fileError; + } + length = strlen(buffer); + if ((result = MaybePageBreak(state))) + goto error; + if ((result = MungeLine(state, buffer, length, lineData, &used))) + goto error; + + if (used < length) + if (fseek(state->file, used - length, SEEK_CUR)) + goto fileError; + } + + /* Compute checksums and prefix them to the line */ + ChecksumLine(fmt, lineData, strlen(lineData), line, &state->pageCRC); + + strcpy(state->pagePos, line); + length = strlen(state->pagePos); + /* Suppress trailing whitespace on blank lines */ + if (length == PREFIX_LENGTH+1 && state->pagePos[length-1] == '\n') { + state->pagePos[--length-1] = '\n'; + state->pagePos[length] = '\0'; + } + state->pagePos += length; + + state->lineNumber++; + } + + if (state->lineNumber > 0) + { + /* Force a final page break */ + state->lineNumber = LINES_PER_PAGE; + state->hdrFlags |= HDR_FLAG_LASTPAGE; + if ((result = MaybePageBreak(state))) + goto error; + } + + result = 0; + goto done; + +fileError: + result = ferror(state->file); + +error: +done: + if (state != NULL) + { + if (state->file != NULL) + fclose(state->file); + free(state); + } + return result; +} + +int main(int argc, char *argv[]) +{ + int result = 0; + int i, j; + int defaultTabWidth = 4; + int binaryMode = 0; + long productNumber = 1; + long fileNumber = 1; + char * endOfNumber; + EncodeFormat const * fmt = NULL; + + InitUtil(); + + for (i = 1; i < argc && argv[i][0] == '-'; i++) + { + if (0 == strcmp(argv[i], "--")) + { + i++; + break; + } + for (j = 1; argv[i][j] != '\0'; j++) + { + if (isdigit(argv[i][j])) + { + defaultTabWidth = argv[i][j] - '0'; + if (defaultTabWidth < 2 || defaultTabWidth > 9) + fprintf(stderr, "WARNING: Weird default tab-width (%d)\n", + defaultTabWidth); + } + else if (argv[i][j] == 'b') + { + binaryMode = 1; + } + else if (argv[i][j] == 'F') + { + fmt = FindFormat(argv[i][j+1]); + if (!fmt || argv[i][j+2] != '\0') + { + fprintf(stderr, "ERROR: Invalid format char\n"); + exit(1); + } + break; + } + else if (argv[i][j] == 'p') + { + productNumber = strtol(&argv[i][j+1], &endOfNumber, 10); + if (*endOfNumber != '\0') + { + fprintf(stderr, "ERROR: Invalid product number\n"); + exit(1); + } + break; + } + else if (argv[i][j] == 'f') + { + fileNumber = strtol(&argv[i][j+1], &endOfNumber, 10); + if (*endOfNumber != '\0') + { + fprintf(stderr, "ERROR: Invalid file number\n"); + exit(1); + } + break; + } + else + { + fprintf(stderr, "ERROR: Unrecognized option -%c\n", argv[i][j]); + exit(1); + } + } + } + if (!fmt) + fmt = binaryMode ? &radix64Format : &hexFormat; + + for (; i < argc; i++) + { + if ((result = MungeFile(argv[i], stdout, fmt, binaryMode, + defaultTabWidth, productNumber, + fileNumber)) != 0) + { + /* If result > 0, message should have already been printed */ + if (result < 0) + fprintf(stderr, "ERROR: %s\n", strerror(result)); + exit(1); + } + fileNumber++; + } + + return 0; +} + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ diff --git a/tools/psgen b/tools/psgen new file mode 100644 index 0000000..2848390 --- /dev/null +++ b/tools/psgen @@ -0,0 +1,324 @@ +#!/usr/bin/perl +# +# psgen -- Postscript generator for code portion of source books +# +# Reads in a list of files/dirs from , runs munge on each of +# them, and generates a single postscript file to stdout. The page numbers +# for each file/dir are put into the file . +# +# usage: psgen [ options... ] > foo.ps +# -l +# -p +# -f +# -D (passed to yapp) +# -P +# -o +# -e (auto edit errors) +# +# $Id: psgen,v 1.18 1997/11/13 21:44:16 colin Exp $ +# + +$bookRoot = $ENV{"BOOKROOT"} || "."; +$toolsDir = "$bookRoot/tools"; +$psDir = "$bookRoot/ps"; +$editor = $ENV{"EDITOR"} || "vi"; + +# Configuration settings - external file names +$mungeProg = "$toolsDir/munge"; +$yappProg = "$toolsDir/yapp"; +$preambleFile = "$psDir/prolog.ps"; +$tempFile = "/tmp/psgen-$$"; + +# Parse arguments +$firstLogPage = $firstPhysPage = 0; +$productNumber = 1; +$font = "OCRB"; +$autoEdit = 0; +while ($#ARGV >= 0 && $ARGV[0] =~ /^-/) +{ + $_ = shift @ARGV; + if (/^--$/) + { + last; + } + elsif (/^-l(\d+)$/) + { + $firstLogPage = $1; + } + elsif (/^-p(\d+)$/) + { + $firstPhysPage = $1; + } + elsif (/^-f(.+)$/) + { + $font = $1; + } + elsif (/^-D(.+)$/) + { + $yappDefs .= " " . $_; + } + elsif (/^-P(\d+)$/) + { + $productNumber = $1; + } + elsif (/^-o(.+)$/) + { + $mungedOutFile = $1; + } + elsif (/^-e$/) + { + $autoEdit = 1; + } + else + { + &Error("Unrecognized option: '$_'"); + } +} +$fileListFile = shift @ARGV || die "Missing file list argument (arg 1)"; +$pageNumFile = shift @ARGV || die "Missing page number file argument (arg 2)"; +$volume = shift @ARGV || die "Missing volume number argument (arg 3)"; + +# Determine initial page numbers +{ + my $nextLogPage = 1; + my $nextPhysPage = 3; + my $volNum = 0; # Which volume's page numbers we're reading + + if ($volume > 1) + { + open(OLDPAGENUMS, "<$pageNumFile") || die; + while () + { + if (/^Volume\s+(\d+)$/) + { + $volNum = $1; + } + elsif (/^Next:\s+(\d+)\s*$/ && $volNum == $volume - 1) + { + $nextLogPage = $1; + } + } + close(OLDPAGENUMS); + } + else + { + unlink($pageNumFile); + } + $firstLogPage = $nextLogPage if ($firstLogPage == 0); + $firstPhysPage = $nextPhysPage if ($firstPhysPage == 0); +} + +# Names of PostScript operators invoked. These are the interface +# between this file and the $preambleFile. +$oddPageStartPS = "OddPageStart"; +$evenPageStartPS = "EvenPageStart"; +$oddPageEndPS = "OddPageEnd"; +$evenPageEndPS = "EvenPageEnd"; +$dirPagePS = "DirPage"; +# This is short because it's emitted every line +$linePS = "L"; + +# Handle an error from munge. +# A result of 0 means to retry, 1 means to exit +sub MungeError +{ + my $result = 1; + + open(FILEH, "<$tempFile") || die; + while () + { + print STDERR; + if (/ in (.*) line (\d+)$/) + { + my ($fileName, $lineNumber) = ($1, $2); + + if ($autoEdit) + { + my @statResult = stat($fileName); + my $oldMTime = $statResult[9]; + + system("'$editor' '+$lineNumber' '$fileName' 1>&2"); + @statResult = stat($fileName); + $result = ($statResult[9] == $oldMTime); + last; + } + } + } + close(FILEH); + unlink($tempFile) || die "Couldn't unlink $tempFile"; + return $result; +} + +sub CopyFileToPS +{ + local $fileName = $_[0]; + local $args = "'-I$psDir' '-Dfont=$font'"; + local $_; + + $args .= $yappDefs; + open(FILEH, "$yappProg $args '$fileName' |") || die; + while () + { + print PSOUT $_; + } + close(FILEH) || exit(1); + 1; +} + +# Wrap a string in parens as required by PostScript, with proper quoting. +sub StringPS +{ + local $str = $_[0]; + + $str =~ s/([\\()])/\\$1/g; + "(" . $str . ")"; +} + +# Emit a start of page. The Postscript DSC %%Page: header +# (followed by logical page number, then physical) and +# the top-of-page function (which is passed the page number as a string) +sub PageStartPS +{ + local $pageNum = $_[0]; + + "%%Page: " . ($pageNum + $firstLogPage) . " " . + ($pageNum + $firstPhysPage) . "\n" . + &StringPS($pageNum + $firstLogPage) . + ((($pageNum + $firstLogPage) % 2) ? $oddPageStartPS + : $evenPageStartPS) . "\n"; +} + +sub PageEndPS +{ + local $pageNum = $_[0]; + + ((($pageNum + $firstLogPage) % 2) ? $oddPageEndPS : $evenPageEndPS) . "\n"; +} + +# Save the page number to a table-of-contents file +sub SavePageNum +{ + local ($fileName, $pageNum) = @_; + + print PAGENUMS ($pageNum + $firstLogPage), ": $fileName\n"; +} + +# The main code. + +open(PSOUT, ">-") || die; +open(FILELIST, "<$fileListFile") || die; +open(PAGENUMS, ">>$pageNumFile") || die; +if ($mungedOutFile ne "") +{ + open(MUNGEDOUT, ">$mungedOutFile") || die; +} + +print PAGENUMS "Volume $volume\n"; + +&CopyFileToPS($preambleFile); + +$fileNumber = 0; +$pageNum = 0; # This is 0-based, since it is added to $first{Log,Phys}Page +$enable = 0; + +while () +{ + /^([VDTB])(\S*)\s+(.*)/ || die "Illegal file list line $."; + + local ($fileType, $options, $arg) = ($1, $2, $3); + + if ($fileType eq "V") + { + @args = split(/\s+/, $arg); + if ($enable = ($args[0] == $volume)) + { + $defaultTabWidth = int($args[1]); + } + } + elsif ($fileType eq "D") + { + next unless $enable; # Do nothing if we're in the wrong volume + $dirName = $arg; + &SavePageNum($dirName, $pageNum); + print PSOUT &PageStartPS($pageNum); + print PSOUT &StringPS($dirName), $dirPagePS, "\n"; + print PSOUT &PageEndPS($pageNum); + $pageNum++; + } + else + { + my $done = 0; + + $fileNumber++; + $fileName = $arg; + next unless $enable; # Do nothing if we're in the wrong volume + &SavePageNum($fileName, $pageNum); + $quotedFileName = $fileName; + $quotedFileName =~ s/'/\\'/g; + $tabWidth = ($options =~ /(\d)/) ? $1 : $defaultTabWidth; + $args = ($fileType eq "B") ? "-b" : ""; + $args .= " -$tabWidth -p$productNumber -f$fileNumber"; + while (!$done) + { + if (open(FILE, "$mungeProg $args '$quotedFileName' 2>$tempFile |")) + { + $line = ; + print MUNGEDOUT $line; + + while ($line ne "") + { + print PSOUT &PageStartPS($pageNum); + + while ($line ne "" and $line !~ /^\f/) + { + chop $line; + print PSOUT &StringPS($line), $linePS, "\n"; + $line = ; + print MUNGEDOUT $line; + } + $line =~ s/^\f//; + + print PSOUT &PageEndPS($pageNum); + $pageNum++; + } + + if (close(FILE)) + { + $done = 2; + } + else + { + $done = &MungeError(); + } + } + else + { + $done = &MungeError(); + } + } + if ($done == 1) + { + die; + } + } +} + +# Print PostScript DSC trailer with the correct number of pages +print PSOUT "%%Trailer\n%%Pages: ", $pageNum, "\n%%EOF\n"; + +print PAGENUMS "Pages: ", $pageNum, "\n"; +print PAGENUMS "Next: ", ((($pageNum+1) & ~1) + $firstLogPage), "\n"; + +close(PAGENUMS) || die; +close(FILELIST) || die; +close(PSOUT) || die; + +if ($mungedOutFile ne "") +{ + close(MUNGEDOUT) || die; +} + +# +# vi: ai ts=4 +# vim: si +# diff --git a/tools/repair.c b/tools/repair.c new file mode 100644 index 0000000..2cced13 --- /dev/null +++ b/tools/repair.c @@ -0,0 +1,1851 @@ +/* + * repair.c -- Program which reconstructs scanned source, locates errors, + * and tries to fix most of them automatically. If it + * can't, it drops you into an editor on the appropriate + * line for manual correction. + * + * Given a file "foo", this appends corrected output to "foo.out" + * and copies remaining uncorrected input in "foo.in". If "foo.in" + * exists initially, "foo" is ignored and only "foo.in" is processed. + * Thus, re-running it repeatedly, possibly with other correction + * techniques in between, will result in correct output in "foo.out" + * and an empty "foo.in" file. + * + * This can automatically invoke an editor for you on the .in file + * and re-run itself. The editor is chosen in the first available way: + * - The -e command-line argument takes a printf() format string to + * format the editor invocation command line with the line number and + * filename. E.g. "emacs +%u %s". %u and %s must appear, in that order. + * - Failing that, the default is "$VISUAL +%u %s" + * - Failing that, the default is "$EDITOR +%u %s" + * - Failing that, the program prints the error location and exits. + * Specifying -e- forces this behaviour. + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Designed by Colin Plumb, Mark H. Weaver, and Philip R. Zimmermann + * Written by Colin Plumb + * + * $Id: repair.c,v 1.37 1997/11/14 08:39:40 mhw Exp $ + */ + +#include +#include +#include +#include +#include +#include + +#include "util.h" +#include "heap.h" +#include "mempool.h" +#include "subst.h" + +/* + * The internal form of a substitution. These are stored on + * lists indexed by the first character of the input substitution. + */ +typedef struct Substitution { + struct Substitution *next; + char const *input, *output; + size_t inlen, outlen; + HeapCost cost, cost2; + FilterFunc *filter; + unsigned int index; /* Consecutive serial numbers */ +} Substitution; + +struct Substitution const substNull = { NULL, "", "", 0, 0, 0, 0, 0 }; + +/* + * This might get increased later to support multiple classes of + * substitutions, for different contexts. Currently, only one + * is used. + */ +#define SUBST_CLASSES 1 + +/* List of substitutions, indexed by first character, plus a catch-all */ +Substitution *substitutions[SUBST_CLASSES][0x101]; + +/* + * The pool of Substitution structures. Remains alive for the entire + * execution of the program. + */ +static MemPool substPool; +static Substitution *substFree; +static unsigned int substCount = 1; /* Preallcoate 0 to substNull */ +static unsigned int substFirstDynamic; +#define SubstIsDynamic(s) ((s)->index >= substFirstDynamic) +/* Adjust the substitution based on noccurrences this page */ +#define SubstAdjust(s,n) ((s)->cost = (s)->cost2) +/* Is this a nasty-line substitution? */ +#define SubstIsNasty(s) ((s)->cost2 == COST_INFINITY) + +/* Every possible single-character string */ +static char substChars[512]; +#define SubstString(c) (substChars+2*((c)&255)) + +/* Set the list of substitutions to empty */ +static void +SubstInit(void) +{ + unsigned int i, j; + + memPoolInit(&substPool); + substFree = 0; + substCount = 1; /* Number zero is reserved for uncounted substitutions */ + for (i = 0; i < elemsof(substitutions); i++) + for (j = 0; j < elemsof(*substitutions); j++) + substitutions[i][j] = NULL; + + for (i = 0; i < 256; i++) { + substChars[2*i] = (char)i; + substChars[2*i+1] = 0; + } +} + +/* + * For dynamically allocated substitutions, we maintain a free list. + * Each substitution has a unique serial number. These are retained + * if a substitution goes on the free list, to keep substCount from + * ratcheting upwards indefinitely while still guaranteeing uniqueness. + */ +static Substitution * +SubstAlloc(void) +{ + struct Substitution *subst = substFree; + + if (subst) { + substFree = subst->next; + } else { + subst = memPoolNew(&substPool, Substitution); + subst->index = substCount++; + } + return subst; +} + +static void +SubstFree(Substitution *subst) +{ + subst->next = substFree; + substFree = subst; +} + +static Substitution * +MakeSubst(char const *input, char const *output, HeapCost cost, HeapCost cost2, + FilterFunc *filter, int class) +{ + struct Substitution *subst, **head; + + subst = SubstAlloc(); + subst->input = input; + subst->output = output; + subst->inlen = strlen(input); + subst->outlen = strlen(output); + subst->cost = cost; + subst->cost2 = cost2; + subst->filter = filter; + + /* + * Ignore certain substitutions when printing stats. + * Identity substitutions, and the tab/space tweaking. + */ + if (strcmp(input, output) == 0 || strcmp(input, TAB_STRING) == 0 || + (input[0] == ' ' && input[1] == 0 && output[0] == 0)) { + if (subst->index == substCount-1) + substCount--; + subst->index = 0; /* Evil hack */ + } + + head = &substitutions[class][input[class] & 255]; + subst->next = *head; + *head = subst; + return subst; +} + +/* + * For each entry in the raw array, turn { "abc", "def", 5" } + * into cost-5 mappings of "a"->"d", "b"->"e" and "c"->"f". + * If the output string is NULL, the characters are deleted. + * An input string of NULL is the end of table delimiter. + */ +static void +SubstSingle(struct RawSubst const *raw, int class) +{ + char const *input, *output; + int i, o; + + while (raw->input) { + input = raw->input; + output = raw->output; + assert(!output || strlen(input) == strlen(output)); + + while (*input) { + i = *input++; + o = output ? *output++ : 0; + (void)MakeSubst(SubstString(i), SubstString(o), + raw->cost, raw->cost2, raw->filter, class); + } + raw++; + } +} + +/* + * For each entry in the raw array, turn { "abc", "def", 5" } + * into a cost-5 mappings of "abc"->"def". + * An input string of NULL is the end of table delimiter. + */ +static void +SubstMultiple(struct RawSubst const *raw, int class) +{ + while (raw->input) { + (void)MakeSubst(raw->input, raw->output, raw->cost, raw->cost2, + raw->filter, class); + raw++; + } +} + +/* Build the substitutions table */ +static void +SubstBuild(void) +{ + SubstInit(); + SubstSingle(substSingles, 0); + SubstMultiple(substMultiples, 0); + substFirstDynamic = substCount; +} + +/* + * See if the desired substitution already exists + */ +static Substitution const * +SubstSearch(char const *in, size_t inlen, char const *out, size_t outlen, + int class) +{ + Substitution *subst = substitutions[class][in[0] & 255]; + + for (; subst; subst = subst->next) { + if (subst->inlen == inlen && subst->outlen == outlen && + memcmp(subst->input, in, inlen) == 0 && + memcmp(subst->output, out, outlen) == 0) + return subst; /* Already exists */ + } + return NULL; +} + + +/* + * Create a new dynamic substitution. First search to make + * sure it doesn't already esist. + */ +static Substitution const * +SubstDynamic(char const *in, char const *out, int class) +{ + Substitution const *subst; + + subst = SubstSearch(in, strlen(in), out, strlen(out), class); + return subst ? subst : MakeSubst(in, out, COST_INFINITY, + DYNAMIC_COST_LEARNED, NULL, class); +} + +/* + * Search for the substitution, allocating one if not found. + * the input string is not null-terminated and needs to be copied to + * an allocated buffer. The output string can just be pointer-copied. + */ +static Substitution const * +SubstNasty(char const *in, size_t inlen, char const *out, int class) +{ + Substitution const *subst; + char *string; + + if ((subst = SubstSearch(in, inlen, out, strlen(out), class)) != NULL) + return subst; + + if (!(string = malloc(inlen+1))) { + fputs("Out of memory!\n", stderr); + exit(1); + } + memcpy(string, in, inlen); + string[inlen] = 0; + return MakeSubst(string, out, COST_INFINITY, COST_INFINITY, NULL, class); +} + +/* + * The state of the parser. + * Note that this is updated when a ParseNode is *removed* from the heap; + * ParseNodes that are in the heap have ParseStates that reflect the + * state before the substitution has been parsed; this is a copy of the + * parents' state, which is after the parsing. + */ +typedef struct ParseState { + CRC page_crc; /* Computed per-page CRC */ + word16 flags; /* Flags; see below */ + unsigned char pos; /* Position on the line */ +} ParseState; /* 7 bytes, rounded to 8 */ + +/* Flags values */ +#define PS_MASK_PAGENUM 0xC000 /* Digits in header page number (1..3) */ +#define PS_SHIFT_PAGENUM 14 /* Shift for the above */ +#define PS_FLAG_EOL 512 /* Expect \n next */ +#define PS_FLAG_SPACE 256 /* Was last char a space? */ +#define PS_FLAG_TAB 128 /* Tabbing over a column */ +#define PS_FLAG_INHEADER 64 /* Current line is a header */ +#define PS_FLAG_PASTHEADER 32 /* A previous line was a header */ +#define PS_FLAG_BINWS 16 /* In whitespace after binary data */ +#define PS_FLAG_BINEND 8 /* End of binary data */ +#define PS_FLAG_DYNAMIC 4 /* Have used ECC this line */ +#define PS_MASK_FORMAT 3 /* The encoding format (max of 3, for now) */ +#define PS_SHIFT_FORMAT 0 /* Shift for the above */ + +/* Have we started on a second page? Used to force flushing of the first. */ +#define InSecondHeader(ps) \ + ((~(ps)->flags & (PS_FLAG_INHEADER | PS_FLAG_PASTHEADER)) == 0) + +#define PageNumDigits(pn) (((pn)->ps.flags & PS_MASK_PAGENUM) >> PS_SHIFT_PAGENUM) +#define PageNumDigitsIncrement(pn) ((pn)->ps.flags += 1<flags & PS_MASK_FORMAT)>>PS_SHIFT_FORMAT] +#define pnFormat(pn) psFormat(&(pn)->ps) +#define psSetFormat(ps, i) \ + ((ps)->flags = ((ps)->flags & ~PS_MASK_FORMAT) | i << PS_SHIFT_FORMAT) + +typedef struct ParseNode { + HeapCost cost; + unsigned int refcnt; + struct ParseNode *parent; + char const *input; + struct Substitution const *subst; + struct ParseState ps; +} ParseNode; /* 32 bytes */ + +/* A handle for walking backwards through the output stream */ +typedef struct OutputHandle { + ParseNode const *node; + char const *output; + unsigned int pos; +} OutputHandle; + +/* Initialize the handle to point to a node (optionally, a position therein) */ +static void +OutputInit(OutputHandle *oh, ParseNode const *node, char const *p) +{ + oh->node = node; + oh->output = p ? p : node->subst->output + node->subst->outlen; + oh->pos = 0; +} + +/* Get the *previous* byte */ +static int +OutputGetPrev(OutputHandle *oh) +{ + if (!oh->node) + return -1; + for (;;) { + if (oh->output != oh->node->subst->output) { + oh->pos++; + return *--oh->output & 255; + } + oh->node = oh->node->parent; + if (!oh->node) + break; + oh->output = oh->node->subst->output + oh->node->subst->outlen; + } + return -1; +} + +/* Return the character just before the node - trivial handy wrapper */ +static int +OutputPrevChar(ParseNode const *node) +{ + OutputHandle oh; + + OutputInit(&oh, node, NULL); + return OutputGetPrev(&oh); +} + +/* + * Unget the last retrieved character (and return it), or + * -1 if that is impossible. At least one character is + * always ungettable, but after that you're on your own. + */ +static int +OutputUnget(OutputHandle *oh) +{ + if (oh->node && *oh->output) { + oh->pos--; + return *oh->output++ & 255; + } + return -1; +} + +/* The position is useful for comparing two OutputHandles. */ +#define OutputPos(oh) ((oh)->pos) + +/* + * Fill backwards from bufend until you hit the given char. + * Use -1 to get the whole buffer. + */ +static char * +OutputGetUntil(OutputHandle oh, char *bufend, int end) +{ + int c; + + while ((c = OutputGetPrev(&oh)) != -1 && c != end) + *--bufend = (char)c; + return bufend; +} + +/* + * The per-page structure. This is actually global, but describes + * the values kept for each page processed. + */ +typedef struct PerPage { + CRC page_check; + char const *maxpos, *minpos; + unsigned int tabsize; /* Zero means this is a binary page */ + unsigned int lines; + unsigned int retries; /* How many retires since last progress? */ + unsigned int max_retries; /* Maximum number of retries needed. */ +} PerPage; + +PerPage perpage; /* The global */ + +static void +PerPageInit(char const *buf) +{ + perpage.maxpos = perpage.minpos = buf; + perpage.page_check = 0; + perpage.tabsize = 4; /* The default */ + perpage.lines = perpage.retries = perpage.max_retries = 0; +} + +/* + * Is the tab substitution being looked at acceptable? + * It is if the length needed to make the tab width come out + * right, it is. Otherwise, it's junk. + */ +HeapCost +TabFilter(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c, tabpos; + OutputHandle oh; + + (void)limit; + if (!perpage.tabsize) + return COST_INFINITY; /* No interest */ + + /* How wide should the tab be? */ + tabpos = (int)((parent->ps.pos-PREFIX_LENGTH) % perpage.tabsize); + if ((int)subst->outlen != (int)perpage.tabsize - tabpos) + return COST_INFINITY; + /* The right number - cost if likely, cost2 if unlikely */ + if (subst->cost == subst->cost2) + return subst->cost; + OutputInit(&oh, parent, NULL); + do { + c = OutputGetPrev(&oh); + } while (c == ' '); + return (c == TAB_CHAR) ? subst->cost : subst->cost2; +} + +/* + * Return cost if near blanks (including end-of-line), cost2 if not, and + * the average of there is a blank on one side. There are additional + * versions for upper- and lower-case. _ is considered upper-case, + * as it's oftne used in acro identifiers. + */ +HeapCost +FilterNearBlanks(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c = OutputPrevChar(parent), score = (isspace(c) != 0); + char const *p = parent->input + parent->subst->inlen; + + score += p == limit || isspace((unsigned char)*p) != 0; + return (subst->cost*score + subst->cost2*(2-score))/2; +} + +HeapCost +FilterNearUpper(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c = OutputPrevChar(parent), score = (isupper(c) != 0 || c == '_'); + char const *p = parent->input + subst->inlen; + + score += p != limit && (isupper((unsigned char)*p) != 0 || *p == '_'); + return (subst->cost*score + subst->cost2*(2-score))/2; +} + +HeapCost +FilterNearXDigit(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c = OutputPrevChar(parent), score = (isxdigit(c) != 0); + char const *p = parent->input + subst->inlen; + + score += p != limit && (isxdigit((unsigned char)*p) != 0); + return (subst->cost*score + subst->cost2*(2-score))/2; +} + +HeapCost +FilterNearLower(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c = OutputPrevChar(parent), score = (islower(c) != 0); + char const *p = parent->input + subst->inlen; + + score += p != limit && (islower((unsigned char)*p) != 0); + return (subst->cost*score + subst->cost2*(2-score))/2; +} + +/* + * cost2 unless previous character was a space (' ' or SPACE_CHAR). + * Note the & 255, necessary since chars might be signed and SPACE_CHAR + * is in the high (negative) half, but c is an int in the range -1..255. + */ +HeapCost +FilterFollowsSpace(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c = OutputPrevChar(parent); + (void)limit; + return (c == ' ' || c == (SPACE_CHAR & 255)) ? subst->cost : subst->cost2; +} + +/* cost2 unless previous character was duplicate of this one */ +HeapCost +FilterAfterRepeat(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int c = OutputPrevChar(parent); + (void)limit; + return (c == subst->output[0]) ? subst->cost : subst->cost2; +} + +/* cost2 unless probably the closing quote in a char constant */ +HeapCost +FilterCharConst(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + OutputHandle oh; + int c; + + (void)limit; + OutputInit(&oh, parent, NULL); + c = OutputGetPrev(&oh); + c = OutputGetPrev(&oh); + if (c == '\\') + c = OutputGetPrev(&oh); + return (c == '\'') ? subst->cost : subst->cost2; +} + +/* + * If the identifier leading up to the current position contains + * an underscore, then it's likely the current position is an underscore + * as well; return cost. If it does not, it's less likely; return cost2. + */ +HeapCost +FilterLikelyUnderscore(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + OutputHandle oh; + int c; + + (void)limit; + OutputInit(&oh, parent, NULL); + for (;;) { + c = OutputGetPrev(&oh); + if (c == '_') + return subst->cost; + if (!isalnum(c)) + return subst->cost2; + } +} + +/* cost2 unless the following chars seem to be a checksum */ +HeapCost +FilterChecksumFollows(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + int i, score = 0; + char const *p = parent->input + subst->inlen; + + if (limit - p < PREFIX_LENGTH) + return subst->cost2; + if (!isspace((unsigned char)p[PREFIX_LENGTH-1])) + return subst->cost2; + for (i = 0; i < PREFIX_LENGTH-1; i++) + score += (p[i] >= '0' && p[i] <= '9') + (p[i] >= 'a' && p[i] <= 'f'); + i = (score >= PREFIX_LENGTH-2 ? subst->cost : subst->cost2); + /* Magic, since this function is perfect on binary files */ + if (i < COST_INFINITY && perpage.tabsize == 0) + i = 0; + return i; +} + +/* Manage a *big* pool of ParseNodes */ + +struct MemPool nodePool; +struct ParseNode *nodeFreeList = 0; + +/* Prepare for node allocations */ +static void +NodePoolInit(void) +{ + memPoolInit(&nodePool); + nodeFreeList = NULL; +} + +/* Free all nodes in one swell foop */ +static void +NodePoolCleanup(void) +{ + nodeFreeList = NULL; + memPoolEmpty(&nodePool); +} + +/* Allcoate a new (uninitialized) node */ +static struct ParseNode * +NodeAlloc(void) +{ + struct ParseNode *node; + + node = nodeFreeList; + if (node) { + nodeFreeList = node->parent; + return node; + } + return memPoolNew(&nodePool, ParseNode); +} + +/* Free a node for reallocation */ +static void +NodeFree(struct ParseNode *node) +{ + node->parent = nodeFreeList; + nodeFreeList = node; +} + +/* + * Decrement a node's reference count, freeing it and + * recursively decrementing its parent's if the count + * goes to zero. + */ +static void +NodeRelease(struct ParseNode *node) +{ + struct ParseNode *parent; + assert(node->refcnt); + + while (!--node->refcnt) { + parent = node->parent; + NodeFree(node); + if (!parent) + break; + node = parent; + } +} + +/* Add nodes to the substitution tree */ + +/* Create a child of the given node, with the given properties. */ +static ParseNode * +AddChild(ParseNode *parent, Substitution const *subst, HeapCost cost) +{ + ParseNode *child; + + if (cost == COST_INFINITY) + return 0; + + cost += parent->cost; + child = NodeAlloc(); + *child = *parent; + /* Child is just like parent, except... */ + child->cost = cost; + child->refcnt = 1; /* The heap */ + child->input += subst->inlen; + child->subst = subst; + child->parent = parent; + parent->refcnt++; + return child; +} + +/* Hash table of nasty lines, indexed by per-line CRC */ +struct NastyLine { + struct NastyLine *next; + char const *line; + CRC crc; +}; + +#define NASTY_HASH_SIZE 256 +static struct NastyLine *nastyHash[NASTY_HASH_SIZE]; /* All zero */ + +struct MemPool nastyStrings, nastyStructs; +static CRCPoly const *nastyPoly = &crcCCITTPoly; +/* + * Create a new NastyString entry if it doesn't already exist. + * Note that this expects the string passed to end in a newline which + * IS hashed but NOT stored + */ +static struct NastyLine * +AddNasty(char const *string) +{ + size_t len = strlen(string); /* Including newline */ + CRC crc = CalculateCRC(nastyPoly, 0, (byte const *)string, len); + struct NastyLine *nasty, **nastyp = nastyHash + (crc % NASTY_HASH_SIZE); + char *line; + + /* Search for an existing copy */ + while ((nasty = *nastyp) != NULL) { + if (nasty->crc == crc && + memcmp(nasty->line, string, len-1) == 0 && + nasty->line[len-1] == 0) + return nasty; + nastyp = &nasty->next; + } + /* Create a new structure */ + *nastyp = nasty = memPoolNew(&nastyStructs, struct NastyLine); + nasty->next = NULL; + nasty->line = line = memPoolAlloc(&nastyStrings, len, 1); + nasty->crc = crc; + memcpy(line, string, len-1); + line[len-1] = 0; + return nasty; +} + +static void +RehashNasties(CRCPoly const *poly) +{ + struct NastyLine *cur, *head; + CRC crc; + int i; + size_t len; + + /* Put everything into one list and clear the hash table */ + head = NULL; + for (i = 0; i < (int)elemsof(nastyHash); i++) { + while ((cur = nastyHash[i]) != NULL) { + nastyHash[i] = cur->next; + cur->next = head; + head = cur; + } + } + /* Recompute CRCs for the list and redistribute them among the buckets */ + while (head) { + cur = head; + head = head->next; + len = strlen(cur->line); + crc = CalculateCRC(poly, 0, (byte const *)cur->line, len); + crc = AdvanceCRC(poly, crc, '\n'); + cur->crc = crc; + cur->next = nastyHash[crc % NASTY_HASH_SIZE]; + nastyHash[crc % NASTY_HASH_SIZE] = cur; + } + nastyPoly = poly; +} + +/* Read in the nastylines file */ +static void +ReadNasties(FILE *f) +{ + char buf[128]; + + while (fgets(buf, sizeof(buf)-1, f)) + AddNasty(buf); +} + +/* + * Convert an encoded string to binary. + * No error checking is performed. + */ +static word32 +GetWord32(EncodeFormat const *format, char const *buf, int len) +{ + word32 w = 0; + + while (len--) + w = (w<bitsPerDigit) + DecodeDigit(format, *buf++); + return w; +} + +/* Attempt nasty line substitutions */ +static void +TryNasty(struct ParseNode *parent, Heap *heap, char const *limit) +{ + struct NastyLine const *nasty; + struct Substitution const *subst; + struct ParseNode *child; + char const *end; + EncodeFormat const *format = pnFormat(parent); + OutputHandle oh; + char buf[4]; + CRC check; + int i; + + /* Make sure the lines are hashed properly */ + if (nastyPoly != format->lineCRC) + RehashNasties(format->lineCRC); + + /* Get the line to be replaced */ + assert(parent->ps.pos == PREFIX_LENGTH); + end = memchr(parent->input, '\n', limit - parent->input); + if (!end) + end = limit; + /* Get the line's check value */ + OutputInit(&oh, parent, NULL); + (void)OutputGetPrev(&oh); + i = 4; + while (--i) + buf[i] = OutputGetPrev(&oh); + check = GetWord32(format, buf, 4); + /* Find the matches */ + nasty = nastyHash[check % NASTY_HASH_SIZE]; + for (; nasty; nasty = nasty->next) { + if (nasty->crc == check) { + subst = SubstNasty(parent->input, end-parent->input, + nasty->line, 0); + if (subst) { + child = AddChild(parent, subst, NASTY_COST); + if (child) { + child->ps.flags |= PS_FLAG_DYNAMIC; + HeapInsert(heap, &child->cost); + } + } + } + } +} + +/* + * Form all of a ParseNode's children and add them to the heap. + * Limit is the limit of allowable lookahead. + */ +static void +AddChildren(ParseNode *parent, Heap *heap, char const *limit) +{ + char c = parent->input[0]; + Substitution *subst = substitutions[0][c & 255]; + ParseNode *child; + HeapCost cost; + +/* If you want to make pure insertion substitutions, do that here */ + + assert(parent->input < limit); /* We always have at least one char */ + + while (subst) { + if (subst->inlen == 1 || /* Easy case */ + ((size_t)(limit-parent->input) >= subst->inlen && + memcmp(subst->input, parent->input, subst->inlen) == 0)) + { + cost = subst->cost; + if (subst->filter) + cost = subst->filter(parent, limit, subst); + child = AddChild(parent, subst, cost); + if (child) + HeapInsert(heap, &child->cost); + } + subst = subst->next; + } + + /* Whole-line substitutions */ + if (parent->ps.pos == PREFIX_LENGTH) + TryNasty(parent, heap, limit); +} + + +/* cost if this line has a dynamic substitution, otherwise cost2 */ +HeapCost +FilterIsDynamic(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + (void)limit; + return (parent->ps.flags & PS_FLAG_DYNAMIC) ? subst->cost : subst->cost2; +} + +/* cost if the current page is binary mode, else cost2 */ +HeapCost +FilterIsBinary(struct ParseNode *parent, char const *limit, + struct Substitution const *subst) +{ + (void)parent; (void)limit; + return perpage.tabsize ? subst->cost2 : subst->cost; +} + +/* Debugging utility */ +#define DEBUG 1 /* Set to 1 to print every line considered */ + +static size_t lastlen = 0; + +static void +OverstrikeLine(char const *line, size_t len) +{ + static size_t lastlen = 0; + int blanklen; + + if (!line) { + if (lastlen) + putchar('\n'); + lastlen = 0; + } else if (len || lastlen) { + if (len > 79) + len = 79; + blanklen = (lastlen > len) ? (int)lastlen - len : 0; + printf("%.*s%*s\r", (int)len, line, blanklen, ""); + fflush(stdout); + lastlen = len; + } +} + +/* Print everything, for debugging */ +static void +PrintLine(char const *line, size_t len) +{ + if (line) { + printf("%.*s\n", (int)len, line); + lastlen = 0; + } +} + +static HeapCost ParseAdvanceString(Heap *heap, ParseNode *pn); + +/* + * Copy the parsechain from tail up to root, and hang it off of + * newroot, adjusting the costs and parse state accordingly. Returns + * NULL if it is unable to (invalid parse, too expensive, etc.) + * Note that as per the convention, ParseAdvanceString is *not* called + * on the new tail node (but is called on all its parents). + */ +static ParseNode * +CopyParse(ParseNode const *tail, ParseNode const *root, ParseNode *newroot) +{ + ParseNode *newtail, *parent; + + if (tail == root) + return newroot; + parent = CopyParse(tail->parent, root, newroot); + if (!parent) + return NULL; + newtail = AddChild(parent, tail->subst, ParseAdvanceString(NULL, parent)); + NodeRelease(parent); + return newtail; +} + +/* + * Replace oldnode with a dynamic substitution for newchar, if possible, + * and fill in the chain down to "tail" just like before, but with no branches. + * Add the resultant ParseNode to the heap. + */ +static void +AddDynamic(Heap *heap, ParseNode const *oldnode, ParseNode const *tail, + int newchar) +{ + Substitution const *subst = oldnode->subst; + ParseNode *newnode; + + /* Only replace one-character substitutions */ + if (subst->outlen != 1) + return; + + subst = SubstDynamic(oldnode->subst->input, SubstString(newchar), 0); + newnode = AddChild(oldnode->parent, subst, -1); /* Try it immediately */ + if (newnode) { + newnode->ps.flags |= PS_FLAG_DYNAMIC; + newnode = CopyParse(tail, oldnode, newnode); + if (newnode) + HeapInsert(heap, &newnode->cost); + } +} + +/* + * Do the same, at a given (1-based) position on the line. Owing to + * a minor glitch, we must never count the tail node, as this has not + * been parsed yet, so its oldnode->ps.pos field is inaccurate. + */ +static void +AddDynamicAt(Heap *heap, int position, ParseNode const *tail, int newchar) +{ + ParseNode const *oldnode = tail; + + do { + oldnode = oldnode->parent; + } while (oldnode->ps.pos > position); + + if (oldnode->ps.pos == position) + AddDynamic(heap, oldnode, tail, newchar); +} + +/* + * Given the computed and input check fields, correct the header field + * that *ends* at the given pos. This can be used for both the line and + * page CRC errors by jyst changing the pos. (It relies on the fact + * that the page CRC fragment fits into the LineCRC type.) + * It also relies on the fact that the CRC is at most 4 digits. + */ +static void +ErrorCorrectHeader(Heap *heap, ParseNode const *tail, int pos, + EncodeFormat const *format, CRC crc, CRC check) +{ + CRC syndrome = crc ^ check; + + /* Find the position and the crc digit at that position */ + while (syndrome >= (CRC)format->radix) { + if (syndrome & (CRC)(format->radix - 1)) + return; /* uncorrectable */ + pos--; + crc >>= format->bitsPerDigit; + syndrome >>= format->bitsPerDigit; + } + /* Paste in the correct digit */ + AddDynamicAt(heap, pos, tail, EncodeDigit(format, crc & (format->radix-1))); +} + +/* + * This function walks back through the line, and if the line CRC could be + * made correct by changing a character to another legal character, + * the change is added (on probation) to the substitution table. + */ +static void +ErrorCorrect(Heap *heap, OutputHandle oh, EncodeFormat const *format, + CRC syndrome) +{ + ParseNode const *tail = oh.node; + int c; + + syndrome = ReverseCRC(format->lineCRC, syndrome, 0); + while (oh.node->ps.pos > PREFIX_LENGTH) { + c = OutputGetPrev(&oh); + if (c == '\n' || c == -1) { /* Can't happen */ + printf("Line ended at pos %d\n", oh.node->ps.pos); + return; + } + syndrome = ReverseCRC(format->lineCRC, syndrome, 0); + if (syndrome >= 0x100 || !substitutions[0][c^syndrome] || + oh.node->subst->outlen != 1) + continue; + AddDynamic(heap, oh.node, tail, c^syndrome); + } +} + +/* + * Parsing operations. This is a rather ugly and ad-hoc parser that + * knows a lot about the fixed-field format produced by the munge + * utility. The main state variable is the position in + * the line, which controls the expected header, the position of + * tab stops, and the maximum permissible line length. + */ +#define OCCASIONALLY 100 + +/* Set up a ParseState to top-of-page */ +static void +ParseStateInit(ParseState *ps) +{ + static struct ParseState const parseNull = { 0, 0, 0 }; + *ps = parseNull; +} + +/* + * Try to accept a newline, checking CRCs and even doing error-correction + * as appropriate. + */ +static int +ParseNewline(Heap *heap, ParseNode *pn, char const *string) +{ + OutputHandle oh; + int c; + char debugbuf[PREFIX_LENGTH+LINE_LENGTH+10]; + char *header, *body, *end; + int pos, width; + CRC crc, check; + ParseNode *temp; + static int occasionally = OCCASIONALLY; + EncodeFormat const *format = pnFormat(pn); + EncodeFormat const *headerFormat = &hexFormat; + + /* Get the line into a buffer for analysis */ + OutputInit(&oh, pn, string); + end = debugbuf + sizeof(debugbuf)-1; + header = OutputGetUntil(oh, end, '\n'); + /* Strip leading and trailing whitespace */ + while (header < end && isspace((unsigned char)header[0])) + header++; + while (header < end && isspace((unsigned char)end[-1])) + end--; + *end++ = '\n'; + + /* Start of checksummed area */ + body = header + PREFIX_LENGTH; + /* Blank lines are missing the trainign space from the prefix */ + if (body >= end) + body = end-1; + + crc = CalculateCRC(format->lineCRC, 0, body, end-body); + check = GetWord32(format, header+2, 4); + if (crc != check) { + if (!--occasionally) { + OverstrikeLine(header, end-header-1); + occasionally = OCCASIONALLY; + } + /* Try ECC on the line */ + /* If we haven't already tried ECC on the line... */ + if (!(pn->ps.flags & PS_FLAG_DYNAMIC)) { + ErrorCorrectHeader(heap, pn, PREFIX_LENGTH-1, format, crc, check); + ErrorCorrect(heap, oh, format, crc ^ check); + } + return COST_INFINITY; + } + /* Good enough that we always print it */ + OverstrikeLine(header, end-header-1); + + /* Okay, now there are two cases - header line or running CRC */ + if (pn->ps.flags & PS_FLAG_INHEADER) { + /* Do things for first header */ + if (!(pn->ps.flags & PS_FLAG_PASTHEADER)) { + /* Check version number */ + width = EncodedLength(headerFormat, HDR_VERSION_BITS); + c = (int)GetWord32(&hexFormat, body, width); + if (c != 0) { + fputs("Fatal: you need a newer version of repair" + " to process this file\n", stderr); + exit(1); + } + /* Suck in page CRC, after version & flags */ + pos = width + EncodedLength(headerFormat, HDR_FLAG_BITS); + width = EncodedLength(headerFormat, format->pageCRC->bits); + perpage.page_check = GetWord32(&hexFormat, body+pos, width); + /* Get tab size */ + pos += width; + width = EncodedLength(headerFormat, HDR_TABWIDTH_BITS); + perpage.tabsize = GetWord32(&hexFormat, body+pos, width); + + /* Once we have the header, don't reconsider */ + if (!(pn->ps.flags & PS_FLAG_PASTHEADER)) + while ((temp = (ParseNode *)HeapGetMin(heap)) != NULL) + NodeRelease(temp); + pn->ps.page_crc = 0; /* Clear for top of page */ + } + } else { + /* Check the CRC-32 */ + crc = CalculateCRC(format->pageCRC, pn->ps.page_crc, body, end-body); + pn->ps.page_crc = crc; + crc = RunningCRCFromPageCRC(format, crc); + check = GetWord32(format, header, 2); + if (crc != check) { + if (!(pn->ps.flags & PS_FLAG_DYNAMIC)) + ErrorCorrectHeader(heap, pn, 2, format, crc, check); + return COST_INFINITY; + } + } + + /* Hey, it's correct! */ + PrintLine(header, end-header-1); + + /* Start next line */ + pn->ps.pos = 0; + /* Clear most other flags, but we *have* got a header */ + c = pn->ps.flags & PS_FLAG_DYNAMIC; + pn->ps.flags &= PS_FLAG_BINEND | PS_MASK_FORMAT; + pn->ps.flags |= PS_FLAG_PASTHEADER; + /* + * Give a bonus to the next line for having completed this one, + * less if it was dynamically fixed. + */ + return c ? COST_LINE : COST_LINE*2/3; +} + +/* + * Advance the parse state with pointed-to character. Returns + * COST_INFINITY if an impossible state is reached, otherwise returns a + * cost value. (Normally 0, this can be increased to penalize unlikely + * output combinations to nudge the correction in a certain direction.) + */ +static HeapCost +ParseAdvance(Heap *heap, ParseNode *pn, char const *string) +{ + int i, retval = 0; + char c = *string; + EncodeFormat const *format = pnFormat(pn); + + /* + * Insist on spaces being correctly converted to SPACE_CHAR. + * There's a little irregularity just before EOL. + * Line contiunation and formfeed are also only legal at EOL. + */ + if (c == ' ') { + if (pn->ps.flags & PS_FLAG_SPACE && !(pn->ps.flags & PS_FLAG_TAB)) + pn->ps.flags |= PS_FLAG_EOL; + pn->ps.flags |= PS_FLAG_SPACE; + } else if (pn->ps.flags & PS_FLAG_EOL) { + if (c != '\n') + return COST_INFINITY; + } else if (c == SPACE_CHAR) { + if (!(pn->ps.flags & PS_FLAG_SPACE)) + pn->ps.flags |= PS_FLAG_EOL; + } else if (c == CONTIN_CHAR || c == FORMFEED_CHAR) { + pn->ps.flags |= PS_FLAG_EOL; + } else { + pn->ps.flags &= ~PS_FLAG_SPACE; + } + + switch (pn->ps.pos) { + case 0: + if (c == ' ' || c == '\n') { + break; /* Ignore ws and blank lines completely */ + } else if (c == '\f' || c == HDR_PREFIX_CHAR) { + /* Start of a new page */ + pn->ps.flags |= PS_FLAG_INHEADER; /* Expect header next */ + if (c == '\f') + break; + /* And fall through to increment pos */ + } else if (pn->ps.flags & PS_FLAG_INHEADER || + pn->ps.flags & PS_FLAG_BINEND || + !(pn->ps.flags & PS_FLAG_PASTHEADER) || + DecodeDigit(format, c) < 0) { + return COST_INFINITY; /* Various illegal cases */ + } + pn->ps.pos++; + break; + case 1: + if ((pn->ps.flags & PS_FLAG_INHEADER)) { + format = FindFormat(c); /* Second char of header */ + if (!format) + return COST_INFINITY; + i = registerFormat(format); + psSetFormat(&pn->ps, i); + pn->ps.pos++; + break; + } + if (DecodeDigit(format, c) < 0) + return COST_INFINITY; /* Illegal */ + pn->ps.pos++; + break; + case 2: + case 3: + case 4: +#if PREFIX_LENGTH != 7 +#error fix this code +#endif + case PREFIX_LENGTH-2: + if (DecodeDigit(format, c) < 0) + return COST_INFINITY; /* Illegal */ + pn->ps.pos++; + break; + case PREFIX_LENGTH-1: + if (c == ' ') { + pn->ps.pos++; + break; + } else if (c != '\n') { + return COST_INFINITY; + } + /* Blank lines may be missing this space char */ + /*FALLTHROUGH*/ + /* The normal line starts here, at position 7 */ + default: + if (pn->ps.flags & PS_FLAG_INHEADER) { /* Header line */ + /* Format is "--abcd 0123456789abcdef012 Page %u of %s" */ + int off = pn->ps.pos - (PREFIX_LENGTH+HDR_ENC_LENGTH); + /* Offset relative to end of hex header */ + if (off < 0) { + if (HexDigitValue(c & 255) < 0) + return COST_INFINITY; + } else if (off < 6) { + if (c != " Page "[off]) /* Yes, this is legal C */ + return COST_INFINITY; + } else if (off == 6) { + if (c < '1' || c > '9') /* First digit of page no. */ + return COST_INFINITY; + } else { + /* Re-base to end of scanned part of page number */ + off -= 7 + PageNumDigits(pn); + if (off == 0) { + if (c >= '0' && c <= '9' && PageNumDigits(pn) < 3) + PageNumDigitsIncrement(pn); + else if (c != ' ') + return COST_INFINITY; + } else if (off < 4) { + if (c != " of "[off]) + return COST_INFINITY; + } else if (off == 4) { + if (!isgraph(c)) + return COST_INFINITY; + } else if (c < ' ' || (c & 255) > '~') { + if (c != '\n') + return COST_INFINITY; + return ParseNewline(heap, pn, string); + } + } + } else if (!perpage.tabsize) { /* Radix-64 line */ + /* Line is "RlNFVF9UQU== \n" */ + if (isspace(c & 255)) { + if (!(pn->ps.flags & PS_FLAG_BINWS)) { + if ((pn->ps.pos - PREFIX_LENGTH) % 4 != 0) + return COST_INFINITY; + pn->ps.flags |= PS_FLAG_BINWS; + if (pn->ps.pos - PREFIX_LENGTH < BYTES_PER_LINE*4/3) + pn->ps.flags |= PS_FLAG_BINEND; + } + if (c == '\n') + return ParseNewline(heap, pn, string); + } else if (pn->ps.flags & PS_FLAG_BINWS) { + return COST_INFINITY; + } else if (c == RADIX64_END_CHAR) { + if ((pn->ps.pos - PREFIX_LENGTH) % 4 < 2) + return COST_INFINITY; + pn->ps.flags |= PS_FLAG_BINEND; + } else if (pn->ps.flags & PS_FLAG_BINEND) { + return COST_INFINITY; + } else if (Radix64DigitValue(c) < 0) { + return COST_INFINITY; + } + } else { /* Normal line */ + /* Make sure tab stops come out right */ + if (pn->ps.flags & PS_FLAG_TAB) { + if (((pn->ps.pos - PREFIX_LENGTH) % perpage.tabsize) == 0) + pn->ps.flags &= ~PS_FLAG_TAB; + else if (c != TAB_PAD_CHAR && c != '\n') { + return COST_INFINITY; /* Illegal */ + } + } + /* + * Yes, this code has hard-coded ASCII assumptions + * It knows that the acceptable range of '\n', ' '..'~', + * TAB_CHAR, FORMFEED_CHAR is in that order. + * Signed char machines have it backwards, to be confusing. + */ + if ((c & 255) < ' ') { + /* Newline! (Or something illegal) */ + if (c != '\n') + return COST_INFINITY; + return ParseNewline(heap, pn, string); + } + /* A normal character */ + if ((c & 255) > '~') { + if (pn->ps.flags & PS_FLAG_INHEADER) + return COST_INFINITY; /* Illegal */ + if (c == TAB_CHAR) + pn->ps.flags |= PS_FLAG_TAB; + else if (c != FORMFEED_CHAR && c != SPACE_CHAR && + c != CONTIN_CHAR) + return COST_INFINITY; /* Illegal */ + } + } + if (++pn->ps.pos > PREFIX_LENGTH + LINE_LENGTH) + return COST_INFINITY; + break; + } + return retval; +} + +/* + * Run the parser over the string in a ParseNode (using repeated calls + * to ParseAdvance). Return the penalty cost, or COST_INFINITY if + * it's impossible + */ +static HeapCost +ParseAdvanceString(Heap *heap, ParseNode *pn) +{ + HeapCost cost, total = 0; + char const *string = pn->subst->output; + + while (*string) { + cost = ParseAdvance(heap, pn, string++); + if (cost == COST_INFINITY) + return cost; + total += cost; + } + return total; +} + +static unsigned int *globalStats = NULL; +static unsigned globalSize = 0; +static unsigned globalEdits = 0; + +/* + * This walks the list of substitutions, performing two tasks with + * the statistics gathered. + * + * First, although not essential, it prints any interesting changes + * (non-identity substitutions) made, and a count of the total number + * of substitutions (including identity) as an approximate character count. + * + * Second, it does maintenance on dynamic (learned during program + * execution) substitutions. It discards any substitutions that end + * up unused, and computes nice costs for the others, based on the + * global (per-file) statistics. + * + * (This function is also called at the end to print the per-file stats, + * which does redundant weight adjustment, but it's harmless.) + */ +static void +UseStats(unsigned int *stats, FILE *log) +{ + unsigned int i, j, n, changes = 0; + unsigned long grand = 0; + Substitution *s, **sp; + + if (!stats) + return; + + /* Yes, this loop is permuted on purpose */ + for (j = 0; j < elemsof(*substitutions); j++) { + for (i = 0; i < elemsof(substitutions); i++) { + sp = &substitutions[i][j]; + while ((s = *sp) != 0) { + grand += n = stats[s->index]; + /* Retain or purge dynamic substitutions, depending. */ + if (SubstIsDynamic(s)) { + if (n) { + SubstAdjust(s, n); + } else if (!globalStats[s->index]) { + /* Forget unused dynamic substitutions */ + *sp = s->next; + if (SubstIsNasty(s)) + free((char *)s->input); /* Dynamically allocated */ + SubstFree(s); + continue; + } + } + sp = &s->next; + /* + * Print interesting substitutions. Some boring substitutions, + * flagged with an index value of zero, are not printed. + */ + if (!s->index || !n) + continue; + changes += n; + fprintf(log, "\t%2ux \"%.*s\"%*s-> \"%.*s\"%*s(cost ", + stats[s->index], (int)s->inlen, s->input, + s->inlen>3 ? 0 : 3-(int)s->inlen, "", + (int)s->outlen, s->output, + s->outlen>3 ? 0 : 3-(int)s->outlen, ""); + fprintf(log, s->cost == COST_INFINITY ? "-" : "%d", s->cost); + if (s->filter) + fprintf(log, s->cost2 == COST_INFINITY ? "/-" : "/%d", + s->cost2); + fputs(SubstIsDynamic(s) ? ") ** LEARNED **\n" : ")\n", log); + } + } + } + fprintf(log, "\tTotal: %u changes (out of %lu)\n", changes, grand); +} + +static void +DoStats(ParseNode const *node, unsigned int page, FILE *log) +{ + unsigned int *stats; + unsigned int n; + + /* Enlarge global stats if needed */ + if (globalSize < substCount) { + stats = realloc(globalStats, substCount * sizeof(*stats)); + if (!stats) { + fputs("Fatal error: out of memory for stats!\n", stderr); + exit(1); + } + for (n = globalSize; n < substCount; n++) + stats[n] = 0; + globalStats = stats; + globalSize = substCount; + } + + /* Allocate per-page stats */ + stats = calloc(substCount, sizeof(*stats)); + if (!stats) { + fputs("Fatal error: out of memory for stats!\n", stderr); + exit(1); + } + /* Cheat and assume that calloc() initializes unsigned ints to zero */ + while (node) { + stats[node->subst->index]++; + node = node->parent; + } + + /* Keep the global counts accurate */ + for (n = 0; n < substCount; n++) + globalStats[n] += stats[n]; + + fprintf(log, "Page %u substitutions:\n", page); + UseStats(stats, log); + + free(stats); +} + +/* Spit out a page of data (needs work). Returns number of lines */ +static unsigned +PrintPage(OutputHandle oh, FILE *out) +{ + char pagebuf[PAGE_BUFFER_SIZE]; + char *p1; /* Beginning of current line */ + char *p2; /* End of current line (WS stripped) */ + char *p3; /* End of current line (newline) */ + char *p4; /* End of all output */ + unsigned lines = 0; + + p4 = pagebuf + sizeof(pagebuf); + p1 = OutputGetUntil(oh, p4, -1); + + /* Output the lines without leading & trailing whitespace */ + while (p1 < p4) { + /* Identify the line */ + p3 = memchr(p1, '\n', p4-p1); + if (!p3) + p3 = p4; + /* Delete leading whitespacee */ + while (isspace((unsigned char)*p1) && p1 < p3) + p1++; + /* Delete trailing whitepace */ + p2 = p3; + while (isspace((unsigned char)p2[-1]) && p1 < p2) + p2--; + /* Spit out this line */ + fwrite(p1, 1, (size_t)(p2-p1), out); + putc('\n', out); + /* Advance p1 past the newline */ + p1 = p3 + 1; + lines++; + } + return lines; +} + +static volatile int interrupt = 0; +static void (* volatile oldhandler)(int) = SIG_DFL; + +static void inthandler(int sig) +{ + if (++interrupt > 2) + (void)signal(sig, oldhandler); +} + +/* + * Given a buffer, process a page from it and try to write a corrected page to + * the out file. Return the number of bytes accessed. (0 if it was unable + * to make any corrections.) + */ +static size_t +DoPage(char const *buf, size_t len, FILE *out, unsigned int page, FILE *log) +{ + ParseNode *node; + Heap heap; + HeapCost cost; + OutputHandle oh; + void (*sighandler)(int); + + HeapInit(&heap, 1000); + NodePoolInit(); + PerPageInit(buf); + + /* Initialize signal handling */ + interrupt = 0; + sighandler = signal(SIGINT, inthandler); + if (sighandler != inthandler) + oldhandler = sighandler; + + /* Make a root node */ + node = NodeAlloc(); + node->cost = 0; + node->refcnt = 1; + node->input = buf; + node->subst = &substNull; + ParseStateInit(&node->ps); + node->parent = NULL; + + HeapInsert(&heap, &node->cost); + + /* The main loop: try to extend the current parse. */ + while ((node = (ParseNode *)HeapGetMin(&heap)) != NULL) { + cost = ParseAdvanceString(&heap, node); + if (cost != COST_INFINITY) { + /* End of the file, or hit a second header line? */ + if (node->input == buf+len || InSecondHeader(&node->ps)) { + /* Try to wrap up page, if page CRC works */ + if (node->ps.page_crc == perpage.page_check) { + /* Success! */ + HeapDestroy(&heap); + OutputInit(&oh, node, NULL); + OverstrikeLine("", 0); + + if (InSecondHeader(&node->ps)) { + /* Back up to last newline */ + OutputInit(&oh, node, NULL); + while (OutputGetPrev(&oh) != '\n') + ; + OutputUnget(&oh); + } + /* oh points to node that emitted last char on page */ + len = oh.node->input - buf; /* Chars eaten this page */ + perpage.lines = PrintPage(oh, out); + DoStats(oh.node, page, log); + NodePoolCleanup(); + return len; + } + } else { + /* Keep working on the page */ + node->cost = cost += node->cost; + if (node->input > perpage.maxpos) { + perpage.maxpos = perpage.minpos = node->input; + if (perpage.max_retries < perpage.retries) + perpage.max_retries = perpage.retries; + perpage.retries = 0; /* Made progress */ + } else if (node->input < perpage.minpos) { + perpage.minpos = node->input; /* Furthest backtrack */ + } + ++perpage.retries; + if (heap.numElems > MAX_HEAP || interrupt) + HeapDestroy(&heap); + else + AddChildren(node, &heap, buf+len); + } + } + NodeRelease(node); + } + /* Failed! */ + OverstrikeLine(NULL, 0); + puts("Stopping for manual edit."); + + NodePoolCleanup(); + /* Get rid of the dynamic substitutions */ + DoStats(NULL, page, log); + + return 0; +} + +/* The magic file-shuffling routine. */ +static int +RepairFile(char const *name, char const *editor, char const *nastylines) +{ + char buf[PAGE_BUFFER_SIZE]; + char *filename; + char const *p; + size_t namelen; + FILE *in = 0, *out = 0, *dump = 0, *log = 0; + size_t inbytes; /* Bytes in input buffer */ + size_t outbytes; /* Bytes taken from input buffer */ + unsigned int pages = 0; /* # of pages processed */ + unsigned int lines = 0; /* # of lines processed (until trouble) */ + unsigned int minline, maxline; /* Where is the error? */ + int giveup; /* Have we had to abort corrections? */ + int err; /* Copy of errno for returns */ + + globalSize = 0; /* Reset global stats */ + + namelen = strlen(name); + if (!(filename = malloc(namelen+10))) { + p = "Unable to allocate memory\n"; + goto error; + } + + memcpy(filename, name, namelen); + strcpy(filename+namelen, ".log"); + puts(filename); + if (!(log = fopen(filename, "at"))) { + p = "Unable to open log file \"%s\"\n"; + goto error; + } + + strcpy(filename+namelen, ".out"); + puts(filename); + if (!(out = fopen(filename, "at"))) { + p = "Unable to open output file \"%s\"\n"; + goto error; + } + +retry: + /* Read in any new nasty lines */ + if (!(in = fopen(nastylines, "rt"))) { + fprintf(stderr, "Unable to open nasty lines file \"%s\"\n", nastylines); + } else { + ReadNasties(in); + fclose(in); + } + /* Try to open input file - .in or original */ + p = filename; + strcpy(filename+namelen, ".in"); + if (!(in = fopen(filename, "rt"))) { + if (!(in = fopen(name, "rt"))) { + filename[namelen] = 0; + p = "Unable to open input file \"%s\"\n"; + goto error; + } + p = name; + } + printf("Repairing from %s\n", p); + strcpy(filename+namelen, ".dmp"); + if (!(dump = fopen(filename, "wt"))) { + p = "Unable to open output file \"%s\"\n"; + goto error; + } + + giveup = 0; + inbytes = 0; /* Bytes already at the front of the buffer */ + /* Append more data from the file */ + while ((inbytes += fread(buf+inbytes, 1, sizeof(buf)-inbytes, in)) != 0) { + if (giveup) { + /* Giving up mode - just copy through */ + outbytes = fwrite(buf, 1, inbytes, dump); + if (!outbytes) { + p = "Error writing dump file!\n"; + goto error; + } + } else { + outbytes = DoPage(buf, inbytes, out, pages+1, log); + NodePoolCleanup(); + if (outbytes) { + pages++; + lines += perpage.lines; + } else { /* Failed */ + /* Find range of backtracking for error location */ + minline = 1; + for (p = buf; p < perpage.minpos; p++) + minline += (*p == '\n'); + for (maxline = minline; p < perpage.maxpos; p++) + maxline += (*p == '\n'); + giveup = 1; + } + } + /* Fewer bytes now in the buffer */ + inbytes -= outbytes; + /* Move those bytes to the front again */ + memmove(buf, buf+outbytes, inbytes); + } + + fclose(in); + in = 0; + fclose(dump); + dump = 0; + + /* Okay, let's get tricky */ + memcpy(buf, name, namelen); + strcpy(buf+namelen, ".dmp"); + strcpy(filename+namelen, ".in"); + + /* teun: MS Visual C doesn't rename on top of existing file; remove it */ + if (remove(filename) != 0) { + err = errno; + fprintf(stderr, "Warning deleting %s\n", filename); + } + + if (rename(buf, filename) != 0) { + err = errno; + fclose(out); + fclose(log); + /* teun: corrected buf, filename order. This cost me an hour */ + fprintf(stderr, "Error renaming %s -> %s\n", buf, filename); + return err; + } + + /* This code is spaghetti - is there a cleaner way? */ + if (giveup) { + printf("Error in %s, lines %u-%u\n", filename, minline, maxline); + fprintf(log, "Error in %s, lines %u-%u\n", filename, minline, maxline); + if (interrupt > 1) + goto manual; + if (editor) { + if (strcmp(editor, "-") == 0) + goto manual; + sprintf(buf, editor, maxline, filename); + } else { + p = getenv("VISUAL"); + if (!p) + p = getenv("EDITOR"); + if (!p) + goto manual; + sprintf(buf, "%s +%u %s\n", p, maxline, filename); + } + printf("Executing %s\n", buf); + globalEdits++; + if (system(buf) == 0) + goto retry; + fputs("Edit failed - aborting\n", stderr); +manual: + puts("Please fix the error by hand and re-run repair."); + } + + fclose(out); + free(filename); + + fprintf(log, "\n%u lines successfully processed.\n", lines); + fprintf(log, "Overall substitutions (%u pages):\n", pages); + UseStats(globalStats, log); + printf("%u manual edits required\n", globalEdits); + fclose(log); + + return 0; + +error: + err = errno; + if (log) fclose(log); + if (dump) fclose(dump); + if (out) fclose(out); + if (in) fclose(in); + fprintf(stderr, p, filename); + free(filename); + return err; +} + +/* Process the command line, calling RepairFile as needed. */ +int +main(int argc, char *argv[]) +{ + int result = 0; + int i; + char const *editor = NULL; + char const *nastylines = "nastylines"; + + InitUtil(); + SubstBuild(); + memPoolInit(&nastyStructs); + memPoolInit(&nastyStrings); + + /* Process leading flags */ + for (i = 1; i < argc && argv[i][0] == '-'; i++) { + if (argv[i][1] == '-' && argv[i][2] == 0) { + i++; + break; + } else if (argv[i][1] == 'e') { + editor = argv[i][2] ? argv[i]+2 : argv[++i]; + } else if (argv[i][1] == 'l') { + nastylines = argv[i][2] ? argv[i]+2 : argv[++i]; + } else { + editor = argv[i][2] ? argv[i]+2 : argv[++i]; + fprintf(stderr, "ERROR: Unrecognized option %s\n", argv[i]); + return 1; + } + } + + /* Process files */ + for (; i < argc; i++) { + result = RepairFile(argv[i], editor, nastylines); + if (result != 0) { + fprintf(stderr, "Fatal error: %s\n", strerror(result)); + return 1; + } + } + + return 0; +} + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ diff --git a/tools/sortpages b/tools/sortpages new file mode 100644 index 0000000..29689fd --- /dev/null +++ b/tools/sortpages @@ -0,0 +1,185 @@ +#!/usr/bin/perl +# +# $Id: sortpages,v 1.8 1997/12/11 19:20:58 mhw Exp $ +# + +@fileNameFromNumber = (); +@pagesFound = (); +$theProductNumber = 0; + +for $fileIndex (0..$#ARGV) +{ + $fileName = $ARGV[$fileIndex]; + open(FILE, "<$fileName") || die; + while (!eof(FILE)) + { + $filePos = tell(FILE); + $_ = ; + if (/^\f?-\S/) + { + my ($versionHex, $flagsHex, $pageCRCHex, $tabWidthHex, + $productNumberHex, $fileNumberHex, $pageNumber, $name) + = (/^\f?-\S\S{4}\ # CRC followed by a space + ([0-9a-f]) # Format version + ([0-9a-f]{2}) # Flags + ([0-9a-f]{8}) # Running CRC32 + ([0-9a-f]) # Tab width (0 means radix64) + ([0-9a-f]{3}) # Product number + ([0-9a-f]{4}) # File number + \ Page\ (\d+)\ of\ (.*)/x); + my $version = hex($versionHex); + my $flags = hex($flagsHex); + my $productNumber = hex($productNumberHex); + my $fileNumber = hex($fileNumberHex); + + unless ($version == 0 && $productNumber > 0 + && $fileNumber > 0 && $pageNumber > 0 + && $name ne "") + { + print STDERR "ERROR: Invalid header info ", + "at $fileName line $.\n"; + exit(1); + } + + if (!defined($fileNameFromNumber[$fileNumber])) + { + $fileNameFromNumber[$fileNumber] = $name; + } + elsif ($fileNameFromNumber[$fileNumber] ne $name) + { + print STDERR "ERROR: Mismatched filename ", + "at $fileName line $.\n"; + exit(1); + } + + if (!$theProductNumber) + { + $theProductNumber = $productNumber; + } + elsif ($theProductNumber != $productNumber) + { + print STDERR "ERROR: Different product number ", + "at $fileName line $.\n"; + exit(1); + } + + push @pagesFound, (sprintf "%5d:%4d:%d:%d:%d", + $fileNumber, $pageNumber, $flags, $fileIndex, $filePos); + } + } + close(FILE) || die; +} + +@pagesFound = sort @pagesFound; + +$result = 0; +$lastFileNumber = 0; +$lastPageNumber = 0; +$nextFileNumber = 1; +$nextPageNumber = 1; +$fileIndexOpen = -1; +foreach (@pagesFound) +{ + my ($fileNumber, $pageNumber, $flags, $fileIndex, $filePos) = split /:/; + + $fileNumber = int($fileNumber); + $pageNumber = int($pageNumber); + + if ($fileNumber == $lastFileNumber && $pageNumber == $lastPageNumber) + { + print STDERR "DUPLICATE: File $fileNumber, page $pageNumber, skipped\n"; + next; + } + + if ($nextFileNumber < $fileNumber && $nextPageNumber != 1) + { + print STDERR "MISSING: File $nextFileNumber, ", + "pages $nextPageNumber - END\n"; + $nextPageNumber = 1; + $nextFileNumber++; + $result = 1; + } + if ($nextFileNumber < $fileNumber) + { + print STDERR "MISSING: Files $nextFileNumber - ", + $fileNumber-1, "\n"; + $nextFileNumber = $fileNumber; + $nextPageNumber = 1; + $result = 1; + } + if ($nextFileNumber != $fileNumber) + { + print STDERR "ERROR: Internal error, unexpected fileNumber\n"; + exit(1); + } + + if ($nextPageNumber < $pageNumber) + { + print STDERR "MISSING: File $fileNumber, pages $nextPageNumber - ", + $pageNumber-1, "\n"; + $nextPageNumber = $pageNumber; + $result = 1; + } + if ($nextPageNumber != $pageNumber) + { + print STDERR "ERROR: Internal error, unexpected pageNumber\n"; + exit(1); + } + + if ($fileIndexOpen != $fileIndex) + { + if ($fileIndexOpen >= 0) + { + close(FILE) || die; + $fileIndexOpen = -1; + } + $fileName = $ARGV[$fileIndex]; + open(FILE, "<$fileName") || die; + $fileIndexOpen = $fileIndex; + } + seek(FILE, $filePos, 0) || die($!); + + $_ = ; + print; + while () + { + last if /^\f?-\S/; + print; + } + $lastFileNumber = $fileNumber; + $lastPageNumber = $pageNumber; + + if ($flags & 1) # Bit 0 of flags indicates last page of file + { + $nextFileNumber++; + $nextPageNumber = 1; + } + else + { + $nextPageNumber++; + } +} + +if ($nextPageNumber != 1) +{ + print STDERR "MISSING: File $nextFileNumber, ", + "pages $nextPageNumber - END\n"; + $nextPageNumber = 1; + $nextFileNumber++; + $result = 1; +} + +print STDERR "Highest file number encountered: ", $nextFileNumber - 1, "\n"; + +if ($fileIndexOpen >= 0) +{ + close(FILE) || die; + $fileIndexOpen = -1; +} + +exit($result); + +# +# vi: ai ts=4 +# vim: si +# diff --git a/tools/subst.c b/tools/subst.c new file mode 100644 index 0000000..76dfe13 --- /dev/null +++ b/tools/subst.c @@ -0,0 +1,222 @@ +/* + * subst.c -- Repair substitution tables + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Written by Colin Plumb + * + * $Id: subst.c,v 1.14 1997/11/03 22:12:00 colin Exp $ + * + * IT IS EXPECTED that users of this program will play with these tables + * and the cost values in the subst.h header. (Some day, they'll all + * get moved to an external config file.) + * + * NOTE: Other cost are hiding in the Filter functions in repair.c. + * Remember to keep them all on the same scale. + */ + +/* + * The repair program copies its input to its output, making various + * substitutions, until it manages to produce a version that satisfies + * the parser. This includes having a correct CRC for each line. + * Each substitution has a cost, and the combinations are tried in order + * of increasing cost. NOTE that even translating "A"->"A" counts as + * a substitution, although it may have zero cost. + * + * The intention is to correct transcription errors, where the + * errors have a distinctly non-uniform distribution. Slight + * differences in cost produce a preference in trying some errors + * first. If an error costs half as much as another, combinations + * of two of that error will be compared to one of the more expensive. + * Too many cheap substitutions will result is repair spending + * a very log time searching before considering the more expensive + * substitutions. + * + * The following parameters and the raw substitution tables are expected + * to be edited by the user based on experience. Eventually, this + * will be moved into an external config file, but for now it's a matter + * of recompiling. + */ + +#include "subst.h" +#include "util.h" + +/* what the OCR software reports for "unrecognizable */ +#define UNRECOG_STRING "~\274" + +/* + * The input substitutions to make (one-to-one). These are listed in + * the order of correction. i.e. uncorrected input first, then corrected + * output. Substitutions are one-way; to get two-way, list it twice. + */ + +struct RawSubst const substSingles[] = { + /* Identity substitutions - note that period (.) is excluded */ + { "!\"#$%&'()*+,-./0123456789:;<=>?" SPACE_STRING, + "!\"#$%&'()*+,-./0123456789:;<=>?" SPACE_STRING, 0, 0, NULL }, + { "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\t" TAB_STRING, + "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\t" TAB_STRING, 0, 0, NULL }, + { "`abcdefghijklmnopqrstuvwxyz{|}~\f" FORMFEED_STRING, + "`abcdefghijklmnopqrstuvwxyz{|}~\f" FORMFEED_STRING, 0, 0, NULL }, +#if (TAB_PAD_CHAR & 128) /* Not already included? */ + { TAB_PAD_STRING, TAB_PAD_STRING, 0, NULL }, +#endif + { "\r\n" CONTIN_STRING, "\n\n" CONTIN_STRING, 0, 0, NULL }, + + /* Occasionally these just get inserted as glitches */ + { ".,'`", NULL, 5, 10, FilterNearBlanks }, + /* This is now pretty infrequent */ + { "-_", "_-", 0, 10, FilterAfterRepeat }, + + /* + * Capitalization errors are common in some cases + * c/C, s/S, u/U are fucked up all the time. + * Also o/O, v/V and w/W. x, y and z also give some problems. + */ + { "cilmopsuvwxyz", "CILMOPSUVWXYZ", 7, 13, FilterNearLower }, + { "CILMOPSUVWXYZ", "cilmopsuvwxyz", 7, 13, FilterNearUpper }, + /* Other errors */ + { "g9aaiji;xX00Si", "9gg2ji;i%%oO3f", 10, 0, NULL }, + /* This seems to happen a lot */ + { "c", "r", 9, 0, NULL }, + + { "j", ";", 9, 0, NULL }, + { "' ", "``", 10, 0, NULL }, + + /* Uncommon errors */ + + /* Wierd stuff that's happened in the checksum part */ + /* A highish weight is okay here */ + { "sSEdJl", "554437", 15, 0, NULL }, + { "LESsPZ", "bb8a22", 15, 0, NULL }, + + /* Wierd stuff that has happened */ + { "BasAeaeRoooo", "3334a@QQpqbd", 5, 15, FilterIsBinary }, + { "oooo", "pqbd", 0, 15, FilterIsBinary }, + { "ttTCCflO", "iff{[lfG", 12, 0, NULL }, +#if 0 + /* If the line-breaks get screwed up, use these */ + { " ", "\n", 10, COST_INFINITY, FilterChecksumFollows }, + { "\n", " ", COST_INFINITY, 10, FilterChecksumFollows }, + { "\n", NULL, COST_INFINITY , 11, FilterChecksumFollows }, +#endif + +{ NULL, NULL, 0, 0, NULL } +}; + +/* The many-to-many substitutions */ +struct RawSubst const substMultiples[] = { + { "''", "\"", 2, 0, NULL }, + { "``", "\"", 2, 0, NULL }, + { ",'", "\"", 2, 0, NULL }, + { "',", "\"", 2, 0, NULL }, + { ",,", "\"", 2, 0, NULL }, + /* Extra inserted spaces are common */ + { " ", " ", COST_INFINITY, 0, FilterFollowsSpace }, + { " ", "", 0, 15, FilterFollowsSpace }, + { "\t", " ", COST_INFINITY, 0, FilterFollowsSpace }, + { "\t", "", 0, 10, FilterFollowsSpace }, + /* Convert between SPACE_CHAR dots and periods */ + { ".", SPACE_STRING, 1, COST_INFINITY, FilterFollowsSpace }, + { ".", " "SPACE_STRING, COST_INFINITY, 10, FilterFollowsSpace }, + { SPACE_STRING, ".", 15, 5, FilterFollowsSpace }, + { SPACE_STRING, " "SPACE_STRING, COST_INFINITY, 5, FilterFollowsSpace }, + + /* Replace "unknown" by zero - it often is */ + { UNRECOG_STRING, "0", 1, 0, NULL }, + { UNRECOG_STRING, "_", 2, 0, NULL }, + { UNRECOG_STRING, ")", 3, 0, NULL }, + { UNRECOG_STRING, "^", 4, 0, NULL }, + /* Except that these glitches are common */ + { UNRECOG_STRING"'", "\\\"", 0, 0, NULL }, + { UNRECOG_STRING"'", "\"", 1, 0, NULL }, + { "'"UNRECOG_STRING, "\"", 0, 0, NULL }, + { UNRECOG_STRING UNRECOG_STRING , "\"", 0, 0, NULL }, + /* Something else that has been seen */ + { "V'", "\\\"", 5, 0, NULL }, + + /* A common transposition */ + { "\"'", "'\"", 5, 0, NULL }, + { "'\"", "\"'", 5, 0, NULL }, + /* These also happen fairly often */ + { " \"", "''", 5, 0, NULL }, + { "\" ", "''", 5, 0, NULL }, + + /* Common glitches */ + { "\t.\n", "\n", 5, 0, NULL }, + { "\t,\n", "\n", 5, 0, NULL }, + { "\t-\n", "\n", 5, 0, NULL }, + { "\t_\n", "\n", 5, 0, NULL }, + { "\t'\n", "\n", 5, 0, NULL }, + { "\t`\n", "\n", 5, 0, NULL }, + { "\t~\n", "\n", 5, 0, NULL }, + { "\t:\n", "\n", 5, 0, NULL }, + { "\t"SPACE_STRING"\n", "\n", 5, 0, NULL }, + + /* Less common */ + { " .\n", "\n", 10, 0, NULL }, + { " ,\n", "\n", 10, 0, NULL }, + { " -\n", "\n", 10, 0, NULL }, + { " _\n", "\n", 10, 0, NULL }, + { " '\n", "\n", 10, 0, NULL }, + { " `\n", "\n", 10, 0, NULL }, + { " ~\n", "\n", 10, 0, NULL }, + { " :\n", "\n", 10, 0, NULL }, + { " "SPACE_STRING"\n", "\n", 10, 0, NULL }, + + /* Even less common */ + { ".\n", "\n", 15, 0, NULL }, + { ",\n", "\n", 15, 0, NULL }, + { "-\n", "\n", 15, 0, NULL }, + { "_\n", "\n", 15, 0, NULL }, + { "'\n", "\n", 15, 0, NULL }, + { "`\n", "\n", 15, 0, NULL }, + { "~\n", "\n", 15, 0, NULL }, + { ":\n", "\n", 15, 0, NULL }, + { SPACE_STRING"\n", "\n", 15, 0, NULL }, + + /* Wierd stuff that has happened */ + { "lJ", "U", 10, 0, NULL }, + { "ll", "U", 10, 0, NULL }, + { "l1", "U", 10, 0, NULL }, + { "il", "U", 10, 0, NULL }, /* Fairly common, actually */ + { "li", "U", 10, 0, NULL }, + { "l)", "U", 10, 0, NULL }, + { "Ll", "U", 10, 0, NULL }, + { "LI", "U", 10, 0, NULL }, + { "L1", "U", 10, 0, NULL }, + + { "lo", "b", 10, 0, NULL }, + { "cl", "d", 10, 0, NULL }, + { "cliff", "diff", 2, 0, NULL }, + { "*\n", "*/\n", 10, 0, NULL }, + + /* That big black block has odd things happen to it */ + { "d", CONTIN_STRING, 10, 0, NULL }, + { "d\n", CONTIN_STRING"\n", 3, 0, NULL }, + { "S", CONTIN_STRING, 10, 0, NULL }, + { "S\n", CONTIN_STRING"\n", 3, 0, NULL }, + + /* Tab-stop wonders */ + { TAB_STRING, TAB_STRING"", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + { TAB_STRING, TAB_STRING" ", 0, 0, TabFilter }, + /* Some scan errors */ + { "D ", TAB_STRING"", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, + { "D ", TAB_STRING" ", 1, 5, TabFilter }, +#if TAB_PAD_CHAR != ' ' +#error Fix those tab patterns! +#endif +{ NULL, NULL, 0, 0, NULL } +}; diff --git a/tools/subst.h b/tools/subst.h new file mode 100644 index 0000000..79005c3 --- /dev/null +++ b/tools/subst.h @@ -0,0 +1,66 @@ +/* + * subst.h -- Header for repair substitutions + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Written by Colin Plumb + * + * $Id: subst.h,v 1.9 1997/11/03 22:12:00 colin Exp $ + */ + +/* + * Give up if the list of pending changes to attempt grows to this many + * elements. Each element is 32 bytes, so 128K is 8 MB of memory. + * (Other than this, repair's memory usage is fairly modest.) + */ +#define MAX_HEAP (1<<17) + +/* + * There is a hack in the code to find a single substitution that will fix a + * line, even if it's not in the tables. It gets added to the tables "on + * probation", with an infinite cost, and if it leads to a successful + * correction of the entire page, is "learned" for future use and its + * cost reduced to something finite. + * (This is not remembered across runs of the program, though. + * Edit the tables in the source to fix it.) + */ +#define DYNAMIC_COST_LEARNED 15 + +/* + * This negative-cost bonus for passing the end of a line with the right + * CRC makes the search engine reluctant to backtrack past a correct CRC, + * greatly improving efficiency. It's rather a hack, though. Think of + * this in terms of "how many errors should be considered in the current + * line before considering the possibility of errors in the previous line?" + * + * This bonus is halved for lines that are the result of a correction + * that was computed from the checksum, since a correct checksum is + * much less significant in such a case. + */ +#define COST_LINE -30 + +/* The cost of a full-line nastyline substitution. */ +#define NASTY_COST 5 + +/* Type describing filter functions used in substitutions */ +struct ParseNode; +struct Substitution; +#include "heap.h" +typedef HeapCost FilterFunc(struct ParseNode *parent, char const *limit, + struct Substitution const *subst); +FilterFunc TabFilter, FilterFollowsSpace, FilterNearBlanks; +FilterFunc FilterNearUpper, FilterNearLower, FilterNearXDigit; +FilterFunc FilterAfterRepeat, FilterCharConst, FilterChecksumFollows; +FilterFunc FilterLikelyUnderscore, FilterIsDynamic, FilterIsBinary; + +/* The external substitution format */ +typedef struct RawSubst { + char const *input; + char const *output; + HeapCost cost, cost2; + FilterFunc *filter; +} RawSubst; + +/* The substitutions to make */ +extern struct RawSubst const substSingles[]; +extern struct RawSubst const substMultiples[]; diff --git a/tools/unmunge.c b/tools/unmunge.c new file mode 100644 index 0000000..831297e --- /dev/null +++ b/tools/unmunge.c @@ -0,0 +1,666 @@ +/* + * unmunge.c -- Program to convert a munged file to original form + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Designed by Colin Plumb, Mark H. Weaver, and Philip R. Zimmermann + * Written by Mark H. Weaver + * + * $Id: unmunge.c,v 1.13 1997/11/13 23:27:08 mhw Exp $ + */ + +#include +#include +#include +#include + +/*#include teun: MS VC wants direct.h for mkdir */ + +#include +#include +#include +#include +#include +#include + +#include "util.h" + +typedef struct UnMungeState +{ + char const * mungedFileName; + char dirName[128]; + char fileName[128]; + char * fileNameTail; + int binaryMode, tabWidth; + long productNumber, fileNumber, pageNumber, lineNumber; + long manifestLineNumber; + word16 hdrFlags; + CRC pageCRC, seenPageCRC; + FILE * manifest; + FILE * file; + FILE * out; +} UnMungeState; + + +/* Returns number of characters decoded, or -1 on error */ +static int +Decode4(char const src[4], byte dest[3]) +{ + int i, length; + byte srcVal[4]; + + for (i = 0; i < 4 && src[i] != RADIX64_END_CHAR; i++) + if ((srcVal[i] = Radix64DigitValue(src[i])) == (byte) -1) + return 1; + + length = i - 1; + if (length < 1) + return -1; + + for (; i < 4; i++) + srcVal[0] = 0; + + dest[0] = (srcVal[0] << 2) | (srcVal[1] >> 4); + dest[1] = (srcVal[1] << 4) | (srcVal[2] >> 2); + dest[2] = (srcVal[2] << 6) | (srcVal[3]); + + return length; +} + +/* + * Return number of characters decoded, or -1 on error + */ +static int +DecodeLine(char const *src, char *dest, int srclength) +{ + int destlength = 0; + int result; + + if (srclength % 4 || !srclength) + return -1; /* Must be a multiple of 4 */ + + while (srclength -= 4) { + if (Decode4(src, dest + destlength) != 3) + return -1; + src += 4; + destlength += 3; + } + result = Decode4(src, dest + destlength); + if (result < 1) + return -1; + return destlength + result; +} + +int PrintFileError(UnMungeState *state, char const *message) +{ + fprintf(stderr, "%s, %s line %ld\n", message, + state->mungedFileName, state->lineNumber); + return 1; +} + +int ReadManifest(UnMungeState *state, long fileNumberWanted, + char const *fileTailPrefix, long prefixLen) +{ + long fileNumber = 0; + long firstMissingFileNum = 0, lastMissingFileNum = 0; + char buffer[512]; + char * p; + + if (state->manifest == NULL) + { + if (fileNumberWanted != 0) + { + assert(fileTailPrefix != NULL); + strncpy(state->fileName, fileTailPrefix, sizeof(state->fileName)); + state->fileName[sizeof(state->fileName) - 1] = '\0'; + state->fileNameTail = state->fileName; + } + return 0; + } + while (fgets(buffer, sizeof(buffer), state->manifest)) + { + if ((p = strchr(buffer, '\n')) != NULL) + *p = '\0'; + state->manifestLineNumber++; + if (buffer[0] == 'D') + { + if (buffer[1] != ' ') + goto invalidManifest; + strncpy(state->dirName, buffer + 2, sizeof(state->dirName)); + if (state->dirName[sizeof(state->dirName) - 1] != '\0') + goto invalidManifest; + } + else + { + fileNumber = strtol(buffer, &p, 10); + if (p == buffer || *p != ' ') + goto invalidManifest; + p++; + + if (fileNumberWanted == 0 || fileNumber < fileNumberWanted) + { + if (firstMissingFileNum == 0) + firstMissingFileNum = fileNumber; + lastMissingFileNum = fileNumber; + continue; + } + else if (fileNumber > fileNumberWanted) + break; + else + { + size_t len; + + len = strlen(state->dirName); + assert(sizeof(state->fileName) >= sizeof(state->dirName)); + memcpy(state->fileName, state->dirName, len); + strncpy(state->fileName + len, p, + sizeof(state->fileName) - len); + if (strncmp(p, fileTailPrefix, prefixLen) != 0) + { + fprintf(stderr, "Mismatched filename, headers say '%s',\n" + " manifest says '%s'\n", + fileTailPrefix, p); + return 1; + } + p = state->dirName; + while ((p = strchr(p, '/')) != NULL) + { + *p = '\0'; + mkdir(state->dirName, 0777); + *p++ = '/'; + } + state->fileNameTail = state->fileName + len; + break; + } + } + } + if (firstMissingFileNum != 0) + { + fprintf(stderr, "Missing files %ld-%ld\n", + firstMissingFileNum, lastMissingFileNum); + } + if (fileNumberWanted != 0 && fileNumber != fileNumberWanted) + { + fprintf(stderr, "Can't find file %ld in manifest file\n", + fileNumberWanted); + return 1; + } + return 0; + +invalidManifest: + fprintf(stderr, "Error parsing manifest file, line %ld\n", + state->manifestLineNumber); + return 1; +} + +int UnMungeFile(char const *mungedFileName, char const *manifestFileName, + int forceOverwrite, int forcePartialFiles) +{ + UnMungeState * state; + EncodeFormat const * fmt = NULL; + char buffer[512]; + char outbuf[BYTES_PER_LINE+1]; + char * line; + char * lineData; + char * p; + int length; + int result = 0; + int skipPage = 0; + CRC lineCRC; + word32 num; + + state = (UnMungeState *)calloc(1, sizeof(*state)); + state->mungedFileName = mungedFileName; + + if (manifestFileName != NULL) + { + if ((state->manifest = fopen(manifestFileName, "r")) == NULL) + goto errnoError; + } + + if ((state->file = fopen(state->mungedFileName, "r")) == NULL) + goto errnoError; + + while (!feof(state->file)) + { + if (fgets(buffer, sizeof(buffer), state->file) == NULL) + { + if (feof(state->file)) + break; + goto fileError; + } + + state->lineNumber++; + + line = buffer; + /* Strip leading whitespace */ + while (isspace(*line)) + line++; + if (*line == '\0') + continue; + + /* Strip trailing whitespace */ + p = line + strlen(line); + while (p > line && (byte)p[-1] < 128 && isspace(p[-1])) + p--; + + lineData = line + PREFIX_LENGTH; + + /* Pad up to at least PREFIX_LENGTH */ + while (p < lineData) + *p++ = ' '; + *p++ = '\n'; + *p = '\0'; + length = p - lineData; + + if (line[0] == HDR_PREFIX_CHAR) + { + fmt = FindFormat(line[1]); + if (!fmt) + { + result = PrintFileError(state, "ERROR: Invalid header type"); + goto error; + } + } + + lineCRC = CalculateCRC(fmt->lineCRC, 0, (byte const *)lineData, length); + + p = line + EncodedLength(fmt, fmt->runningCRCBits); + if (DecodeCheckDigits(fmt, p, NULL, fmt->lineCRC->bits, &num) + || lineCRC != num) + { + result = PrintFileError(state, "ERROR: Line CRC failed"); + goto error; + } + + if (line[0] == HDR_PREFIX_CHAR) + { + int formatVersion; + int flags; + CRC seenPageCRC; + int tabWidth; + long productNumber; + long fileNumber; + long pageNumber; + char * fileNameTail; + int skipNextPage = 0; + char * p; + EncodeFormat const * hFmt = &hexFormat; + + /* Parse header line */ + p = lineData; + + if (DecodeCheckDigits(hFmt, p, &p, HDR_VERSION_BITS, &num)) + { + invalidHeader: + result = PrintFileError(state, "ERROR: Invalid header"); + goto error; + } + formatVersion = num; + + if (DecodeCheckDigits(hFmt, p, &p, HDR_FLAG_BITS, &num)) + goto invalidHeader; + flags = num; + + if (DecodeCheckDigits(hFmt, p, &p, fmt->pageCRC->bits, &num)) + goto invalidHeader; + seenPageCRC = num; + + if (DecodeCheckDigits(hFmt, p, &p, HDR_TABWIDTH_BITS, &num)) + goto invalidHeader; + tabWidth = num; + + if (DecodeCheckDigits(hFmt, p, &p, HDR_PRODNUM_BITS, &num)) + goto invalidHeader; + productNumber = num; + + if (DecodeCheckDigits(hFmt, p, &p, HDR_FILENUM_BITS, &num)) + goto invalidHeader; + fileNumber = num; + + if (sscanf(p, " Page %ld of ", &pageNumber) < 1) + goto invalidHeader; + + if (formatVersion > 0) + { + result = PrintFileError(state, + "ERROR: Format too new for " + "this version of unmunge"); + goto error; + } + + p = strstr(p, " of "); + if (p == NULL) + goto invalidHeader; + + fileNameTail = p + 4; + p = fileNameTail + strlen(fileNameTail); + if (p < fileNameTail + 3 || p[-1] != '\n') + goto invalidHeader; + else + p[-1] = '\0'; + + if (state->out != NULL && state->pageCRC != state->seenPageCRC) + { + result = PrintFileError(state, + "ERROR: Page CRC mismatch on page before"); + goto error; + } + + if ((state->hdrFlags & HDR_FLAG_LASTPAGE) && state->out != NULL) + { + fclose(state->out); + state->out = NULL; + } + + if (state->out != NULL) + { + if (pageNumber != state->pageNumber + 1 || + fileNumber != state->fileNumber || + productNumber != state->productNumber || + tabWidth != state->tabWidth || + strcmp(fileNameTail, state->fileNameTail) != 0) + { + if (fileNumber == state->fileNumber && + pageNumber > state->pageNumber + 1) + { + (void)PrintFileError(state, + "ERROR: Missing pages of this file"); + if (forcePartialFiles && !state->binaryMode) + { + fputs("\n\n@@@@@@ Missing pages here! @@@@@@\n\n", + state->out); + } + else + { + skipNextPage = 1; + fclose(state->out); + state->out = NULL; + remove(state->fileName); + } + } + else + { + (void)PrintFileError(state, + "ERROR: Missing pages of previous file"); + if (forcePartialFiles && !state->binaryMode) + { + fputs("\n\n@@@@@@ Missing pages here! @@@@@@\n\n", + state->out); + /* Make it non-fatal, though... */ + fclose(state->out); + state->out = NULL; + } + else + { + fclose(state->out); + state->out = NULL; + remove(state->fileName); + } + } + } + } + if (state->out == NULL) + { + if (pageNumber != 1 && !skipPage) + (void)PrintFileError(state, + "ERROR: File doesn't begin with page 1"); + + state->binaryMode = (tabWidth == 0); + + if (pageNumber != 1 && (state->binaryMode + || !forcePartialFiles)) + { + skipNextPage = 1; + } + else + { + /* TODO: Use global filelist to get pathname */ + result = ReadManifest(state, fileNumber, fileNameTail, + strlen(fileNameTail)); + if (result != 0) + goto error; + + if (!forceOverwrite) + { + FILE * file; + + /* Make sure file doesn't already exist */ + file = fopen(state->fileName, "r"); + if (file != NULL) + { + fclose(file); + fprintf(stderr, "ERROR: %s already exists\n", + state->fileName); + result = 1; + goto error; + } + } + + state->out = fopen(state->fileName, + state->binaryMode ? "wb" : "w"); + if (state->out == NULL) + goto errnoError; + + if (pageNumber != 1) + fputs("\n\n@@@@@@ Missing pages here! @@@@@@\n\n", + state->out); + } + } + + state->pageCRC = 0; + state->seenPageCRC = seenPageCRC; + state->hdrFlags = (word16)flags; + state->pageNumber = pageNumber; + state->fileNumber = fileNumber; + state->productNumber = productNumber; + state->tabWidth = tabWidth; + skipPage = skipNextPage; + } + else if (!skipPage) + { + if (state->out == NULL) + { + result = PrintFileError(state, "ERROR: Missing header line"); + goto error; + } + + /* Normal data line */ + state->pageCRC = CalculateCRC(fmt->pageCRC, state->pageCRC, + (byte const *)lineData, + length); + line[2] = '\0'; + if (DecodeCheckDigits(fmt, line, NULL, fmt->runningCRCBits, &num) + || RunningCRCFromPageCRC(fmt, state->pageCRC) != num) + { + result = PrintFileError(state, "ERROR: Running CRC failed"); + goto error; + } + + if (state->binaryMode) + { + length = DecodeLine(lineData, outbuf, length-1); + if (length < 0 || length > BYTES_PER_LINE) { + result = PrintFileError(state, + "ERROR: Corrupt radix-64 data"); + goto error; + } + fwrite(outbuf, 1, length, state->out); + } + else + { + p = lineData; + while (*p != '\0') + { + if (*p == TAB_CHAR) + { + p++; + putc('\t', state->out); + while ((p - lineData) % state->tabWidth) + { + if (*p == '\n') + break; + else if (*p == ' ') + p++; + else + { + result = PrintFileError(state, + "ERROR: Not enough spaces " + "after a tab character"); + goto error; + } + } + } + else if (*p == FORMFEED_CHAR) + { + p++; + if (*p != '\n') + { + result = PrintFileError(state, + "ERROR: Formfeed character " + "not at end of line"); + goto error; + } + p++; /* Skip newline */ + putc('\f', state->out); + } + else if (*p == CONTIN_CHAR) + { + p++; + if (*p != '\n') + { + result = PrintFileError(state, + "ERROR: Continuation character " + "not at end of line"); + goto error; + } + p++; /* Skip newline */ + } + else if (*p == SPACE_CHAR) + { + putc(' ', state->out); + p++; + } + else + { + putc(*p, state->out); + p++; + } + } + } + } + } + if (state->out != NULL) + { + if (!(state->hdrFlags & HDR_FLAG_LASTPAGE)) + { + result = PrintFileError(state, "ERROR: Missing pages"); + goto error; + } + if (state->pageCRC != state->seenPageCRC) + { + result = PrintFileError(state, + "ERROR: Page CRC failed on previous page"); + goto error; + } + } + + /* Check for missing files at the end */ + result = ReadManifest(state, 0, NULL, 0); + goto done; + +errnoError: + result = errno; + goto printError; + +fileError: + result = ferror(state->file); + +printError: + fprintf(stderr, "ERROR: %s\n", strerror(result)); + +error: +done: + if (state != NULL) + { + if (state->out != NULL) + fclose(state->out); + if (state->file != NULL) + fclose(state->file); + if (state->manifest != NULL) + fclose(state->manifest); + free(state); + } + return result; +} + +void UsageAndExit(int result) +{ + fprintf(stderr, + "Usage: unmunge [-fp] []\n" + " -f Force overwrites of existing files\n" + " -p Force unmunge of partial files\n"); + exit(result); +} + +int main(int argc, char *argv[]) +{ + int result = 0; + int forceOverwrite = 0; + int forcePartialFiles = 0; + char * fileName = NULL; + char * manifestFileName = NULL; + int i, j; + + InitUtil(); + + for (i = 1; i < argc && argv[i][0] == '-'; i++) + { + if (0 == strcmp(argv[i], "--")) + { + i++; + break; + } + for (j = 1; argv[i][j] != '\0'; j++) + { + if (argv[i][j] == 'h') + UsageAndExit(0); + else if (argv[i][j] == 'f') + forceOverwrite = 1; + else if (argv[i][j] == 'p') + forcePartialFiles = 1; + else + { + fprintf(stderr, "ERROR: Unrecognized option -%c\n", argv[i][j]); + UsageAndExit(1); + } + } + } + + if (i < argc) + fileName = argv[i++]; + if (i < argc) + manifestFileName = argv[i++]; + if (fileName == NULL || i < argc) + UsageAndExit(1); + + if ((result = UnMungeFile(fileName, manifestFileName, + forceOverwrite, forcePartialFiles)) != 0) + { + /* If result > 0, message should have already been printed */ + if (result < 0) + fprintf(stderr, "ERROR: %s\n", strerror(result)); + exit(1); + } + + return 0; +} + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ + diff --git a/tools/util.c b/tools/util.c new file mode 100644 index 0000000..f487436 --- /dev/null +++ b/tools/util.c @@ -0,0 +1,198 @@ +/* + * util.c -- Miscellaneous shared code/data + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Written by Mark H. Weaver + * + * $Id: util.c,v 1.11 1997/11/07 00:44:10 mhw Exp $ + */ + +#include +#include "util.h" + +char const hexDigits[] = "0123456789abcdef"; +char const radix64Digits[] = +#if 0 /* Standard */ + "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; +#else /* Modified form that avoids hard-to-OCR characters */ + "ABCDEFGHIJKLMNPQRSTVWXYZabcdehijklmnpqtuwy145689\\^!#$%&*+=/:<>?@"; +#endif + +signed char hexDigitsInv[256]; +signed char radix64DigitsInv[256]; + +/* teun: moved intitialisation of all three CRCPoly's to initUtil() */ + +/* CRC-CCITT: x^16 + x^12 + x^5 + 1 */ +CRCPoly crcCCITTPoly; +/* + * PRZ's magic 24-bit polynomial - (x+1) * (irreducible of degree 23) + * x^24 +x^23 +x^18 +x^17 +x^14 +x^11 +x^10 +x^7 +x^6 +x^5 +x^4 +x^3 +x +1 + * (Developed by Neal Glover). Note: this is bit-reversed from the form + * used in PGP, 0x1864cfb. + */ +CRCPoly crc24Poly; +/* CRC-32: x^32+x^26+x^23+x^22+x^16+x^12+x^11+x^10+x^8+x^7+x^5+x^4+x^2+x+1 */ +CRCPoly crc32Poly; + +EncodeFormat const hexFormat = +{ + NULL, /* nextFormat */ + '-', /* headerTypeChar */ + hexDigits, /* digits */ + hexDigitsInv, /* digitsInv */ + 4, /* bitsPerDigit */ + 16, /* radix */ + &crcCCITTPoly, /* lineCRC */ + &crc32Poly, /* pageCRC */ + 8, /* runningCRCBits */ + 24, /* runningCRCShift */ + 0xFF /* runningCRCMask */ +}; + +EncodeFormat const radix64Format = +{ + &hexFormat, /* nextFormat */ + 'A', /* headerTypeChar */ + radix64Digits, /* digits */ + radix64DigitsInv, /* digitsInv */ + 6, /* bitsPerDigit */ + 64, /* radix */ + &crc24Poly, /* lineCRC */ + &crc32Poly, /* pageCRC */ + 12, /* runningCRCBits */ + 20, /* runningCRCShift */ + 0xFFF /* runningCRCMask */ +}; + +EncodeFormat const * firstFormat = &radix64Format; + + +static void InitCRCPoly(CRCPoly *poly) +{ + int i, oneBit; + CRC crc = 1; + + poly->table[0] = 0; + for (oneBit = 0x80; oneBit > 0; oneBit >>= 1) { + crc = (crc >> 1) ^ ((crc & 1) ? poly->poly : 0); + for (i = 0; i < 0x100; i += 2 * oneBit) + poly->table[i + oneBit] = poly->table[i] ^ crc; + } +} + +CRC CalculateCRC(CRCPoly const *poly, CRC crc, + byte const *buffer, size_t length) +{ + while (length--) + crc = (crc >> 8) ^ poly->table[(crc & 0xFF) ^ (*buffer++)]; + return crc; +} + +CRC ReverseCRC(CRCPoly const *poly, CRC crc, byte b) +{ + int i, highBit = poly->highBit; + + for (i = 0; i < 8; i++) { + if (crc & highBit) /* highBit is 2^(poly->bits-1) */ + crc = ((crc ^ poly->poly) << 1) ^ 1; + else + crc <<= 1; + } + return crc ^ b; +} + +static void InitDigitsInv(char const *digits, signed char *digitsInv) +{ + int i; + + for (i = 0; i < 256; i++) + digitsInv[i] = -1; + for (i = 0; digits[i]; i++) + digitsInv[(byte)digits[i]] = i; +} + +/* Returns the number of chars encoded */ +int EncodeCheckDigits(EncodeFormat const *fmt, word32 num, + int numBits, char *dest) +{ + int destLen = EncodedLength(fmt, numBits); + word32 digitMask = fmt->radix - 1; + int i; + + for (i = destLen - 1; i >= 0; i--) + { + dest[i] = EncodeDigit(fmt, num & digitMask); + num >>= fmt->bitsPerDigit; + } + return destLen; +} + +/* Returns 1 if there's an error */ +int DecodeCheckDigits(EncodeFormat const *fmt, char const *src, char **endPtr, + int numBits, word32 *valuePtr) +{ + word32 value = 0; + int digitValue; + int i = EncodedLength(fmt, numBits); + + while (i--) + { + digitValue = DecodeDigit(fmt, *src++); + if (digitValue < 0) + { + /* Invalid digit found */ + *valuePtr = 0; + if (endPtr) + *endPtr = NULL; + return 1; + } + value = (value << fmt->bitsPerDigit) | digitValue; + } + *valuePtr = value; + if (endPtr) + *endPtr = (char *)src; + return 0; +} + +EncodeFormat const *FindFormat(char headerTypeChar) +{ + EncodeFormat const * fmt = firstFormat; + + while (fmt && fmt->headerTypeChar != headerTypeChar) + fmt = fmt->nextFormat; + return fmt; +} + +void InitUtil() +{ + /* teun: removed "{ }" for MS VC compile */ + + crcCCITTPoly.bits = 16; + crcCCITTPoly.poly = 0x8408; + crcCCITTPoly.highBit = 0x8000; + + crc24Poly.bits = 24; + crc24Poly.poly = 0xdf3261; + crc24Poly.highBit = 0x800000; + + crc32Poly.bits = 32; + crc32Poly.poly = 0xedb88320; + crc32Poly.highBit = 0x80000000; + + InitCRCPoly(&crcCCITTPoly); + InitCRCPoly(&crc24Poly); + InitCRCPoly(&crc32Poly); + InitDigitsInv(hexDigits, hexDigitsInv); + InitDigitsInv(radix64Digits, radix64DigitsInv); +} + + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ diff --git a/tools/util.h b/tools/util.h new file mode 100644 index 0000000..b2e06bd --- /dev/null +++ b/tools/util.h @@ -0,0 +1,149 @@ +/* + * util.h -- Miscellaneous defines + * + * Copyright (C) 1997 Pretty Good Privacy, Inc. + * + * Written by Mark H. Weaver + * + * $Id: util.h,v 1.23 1997/11/12 23:28:56 mhw Exp $ + */ + +#ifndef UTIL_H +#define UTIL_H 1 + +typedef unsigned long word32; +typedef unsigned short word16; +typedef unsigned char byte; + +#define FMT32 "%08lx" +#define FMT16 "%04x" +#define FMT8 "%02x" + +#define TAB_CHAR '\244' /* Currency symbol, like o in top of x */ +#define TAB_STRING "\244" +#define TAB_PAD_CHAR ' ' /* The fact that this is space has leaked. */ +#define TAB_PAD_STRING " " /* It may not be freely changed. */ +#define FORMFEED_CHAR '\245' /* Yen symbol, like = on top of Y */ +#define FORMFEED_STRING "\245" +#define SPACE_CHAR '\267' /* Middle dot, or bullet */ +#define SPACE_STRING "\267" +#define CONTIN_CHAR '\266' /* Pilcrow (paragraph symbol) */ +#define CONTIN_STRING "\266" + +#define BYTES_PER_LINE 60 /* When using radix 64 */ + +#define LINES_PER_PAGE 72 /* Exclusive of 2 header lines */ +#define LINE_LENGTH 80 +#define PREFIX_LENGTH 7 /* Length of prefix, including the space */ + +#define HDR_PREFIX_CHAR '-' +#define RADIX64_END_CHAR '-' + +typedef struct EncodeFormat EncodeFormat; +typedef word32 CRC; +typedef word16 CRCFragment; + +typedef struct +{ + CRC table[256]; + int bits; + CRC poly; + CRC highBit; +} CRCPoly; + +struct EncodeFormat +{ + EncodeFormat const *nextFormat; + char headerTypeChar; + char const * digits; + signed char const * digitsInv; + int bitsPerDigit; + int radix; + CRCPoly const * lineCRC; + CRCPoly const * pageCRC; + int runningCRCBits; + int runningCRCShift; + int runningCRCMask; +}; + + +#define HDR_ENC_LENGTH 19 /* Length of encoded prefix on header */ + +#define HDR_VERSION_BITS 4 +#define HDR_FLAG_BITS 8 +/* Page CRC bits omitted, since it's not constant */ +#define HDR_TABWIDTH_BITS 4 +#define HDR_PRODNUM_BITS 12 +#define HDR_FILENUM_BITS 16 + + +/* Enough to hold one whole page of munged data */ +/* There is no point making this excessively too large */ +#define PAGE_BUFFER_SIZE 8192 + +#if PAGE_BUFFER_SIZE < (LINES_PER_PAGE + 2) * (LINE_LENGTH + PREFIX_LENGTH + 2) +#error PAGE_BUFFER_SIZE is too small +#endif + + +/* Header flags */ +#define HDR_FLAG_LASTPAGE 0x01 /* Indicates last page of file */ + + +#define elemsof(array) (sizeof(array)/sizeof(*(array))) + + +extern char const hexDigits[]; +extern char const radix64Digits[]; + +extern signed char hexDigitsInv[256]; +extern signed char radix64DigitsInv[256]; + +extern CRCPoly crcCCITTPoly, crc24Poly, crc32Poly; + +extern EncodeFormat const hexFormat, radix64Format; +extern EncodeFormat const * firstFormat; + + +#define HexDigitValue(ch) hexDigitsInv[(byte)(ch)] +#define Radix64DigitValue(ch) radix64DigitsInv[(byte)(ch)] + +/* Returns the number of chars needed to encode the given number of bits */ +#define EncodedLength(fmt, numBits) \ + (((numBits) + (fmt)->bitsPerDigit - 1) / (fmt)->bitsPerDigit) +#define EncodeDigit(fmt, value) ((fmt)->digits[value]) +#define DecodeDigit(fmt, digit) ((fmt)->digitsInv[(byte)digit]) + +#define AdvanceCRC(poly, crc, b) \ + ((crc) >> 8) ^ (poly)->table[((crc) ^ (b)) & 0xFF] + +#define RunningCRCFromPageCRC(fmt, pageCRC) \ + (((pageCRC) >> (fmt)->runningCRCShift) & (fmt)->runningCRCMask) + + +CRC CalculateCRC(CRCPoly const *poly, CRC crc, + byte const *buffer, size_t length); +CRC ReverseCRC(CRCPoly const *poly, CRC crc, byte b); + +/* Returns the number of chars encoded */ +int EncodeCheckDigits(EncodeFormat const *fmt, word32 num, + int numBits, char *dest); + +/* Returns 1 if there's an error */ +int DecodeCheckDigits(EncodeFormat const *fmt, char const *src, char **endPtr, + int numBits, word32 *valuePtr); + +EncodeFormat const *FindFormat(char headerTypeChar); + +void InitUtil(); + + +#endif /* !UTIL_H */ + +/* + * Local Variables: + * tab-width: 4 + * End: + * vi: ts=4 sw=4 + * vim: si + */ diff --git a/tools/yapp b/tools/yapp new file mode 100644 index 0000000..ac78227 --- /dev/null +++ b/tools/yapp @@ -0,0 +1,286 @@ +#!/usr/bin/perl +# +# Yet another preprocessor +# +# $Id: yapp,v 1.5 1997/10/24 07:51:05 mhw Exp $ +# + +%vars = ('' => '$'); +@incPath = ("."); + +sub Error +{ + print STDERR $_[0], "\n"; + exit(1); +} + +sub VarSubst +{ + my ($varName, $undefOkay) = @_; + + if (defined($vars{$varName})) + { + return $vars{$varName}; + } + elsif (!$undefOkay) + { + &Error("Undefined variable '$varName' in $fileName line $."); + } +} + +sub NullFilter +{ + 0; +} + +sub IfFilter +{ + local $_ = $_[0]; + + if (/^##else(\s+.*)?/) + { + return 1; + } + elsif (/^##endif(\s+.*)?/) + { + return 2; + } + else + { + return 0; + } +} + +sub DoFile +{ + local $fileName = $_[0]; + my $path; + local *FILE; + + if ($fileName =~ m|^/|) + { + $path = $fileName; + } + else + { + for $dir (@incPath) + { + if (-e "$dir/$fileName") + { + $path = "$dir/$fileName"; + last; + } + } + } + if ($path eq "") + { + &Error("Can't find '$fileName', from $fileName line $."); + } + + open(FILE, "<$path") || &Error("Can't open $path: $!"); + &DoOpenFile(*FILE, *NullFilter, 0); + close(FILE) || die; + 0; +} + +sub DoPrepass +{ + local ($_, $skipFlag) = @_; + + return "" if /^###/; + s/\s*###.*//; # Strip comments + s/\${(\w+)}/&VarSubst($1, $skipFlag)/eg; # Do variable substitutions + $_; +} + +sub DoOpenFile +{ + local *FILE = $_[0]; + local *filter = $_[1]; + my $skipFlag = $_[2]; + my $result; + local $_; + + while () + { + $_ = &DoPrepass($_, $skipFlag); + if ($result = &filter($_)) + { + return $result; + } + elsif (/^##(\w*)(\s+(.*))?/) + { + my ($cmd, $params) = ($1, $3); + + if ($cmd =~ /^if/) + { + my $condition; + my $ifStartLine = $.; + + if ($cmd eq "if") + { + if ($params =~ /^(\d+)\s*$/) + { + $condition = int($1); + } + elsif ($params =~ /^(\d+)\s*([=!]=|[<>]=?)\s*(\d+)\s*$/) + { + my ($left, $op, $right) = ($1, $2, $3); + + $condition = eval($left . $op . $right); + } + elsif ($params =~ /^(\S+)\s*(eq|ne)\s*(\S+)\s*$/) + { + my ($left, $op, $right) = ($1, $2, $3); + + $left =~ s/([\\'])/\\$1/g; + $right =~ s/([\\'])/\\$1/g; + $condition = eval("'$left' $op '$right'"); + } + else + { + &Error("Invalid ##if params: '$params' " . + "in $fileName line $."); + } + } + elsif ($cmd =~ /^ifn?def$/) + { + if ($params =~ /^(\w+)\s*$/) + { + $condition = defined($vars{$1}); + $condition = !$condition if ($cmd eq "ifndef"); + } + else + { + &Error("Invalid ##$cmd param: '$params' " . + "in $fileName line $."); + } + } + + # Do main body of if + $result = &DoOpenFile(*FILE, *IfFilter, + $skipFlag || !$condition); + + if ($result == 1) # an '##else' was found + { + # Handle else + $result = &DoOpenFile(*FILE, *IfFilter, + $skipFlag || $condition); + } + + if ($result == 1) # a second '##else' was found + { + &Error("Two ##else's in a row in $fileName line $."); + } + elsif ($result == 0) # EOF was encountered + { + &Error("Unterminated ##if " . + "in $fileName line $ifStartLine"); + } + } + elsif ($cmd eq "include") + { + if ($skipFlag) + { + } + elsif ($params =~ /^"(.*)"\s*$/) + { + my $incFile = $1; + + &DoFile($incFile); + } + else + { + &Error("Invalid ##include params: '$params'"); + } + } + elsif ($cmd eq "set") + { + if ($params =~ /^(\w+)=<<(")(.*)"\s*$/ or + $params =~ /^(\w+)=<<(')(.*)'\s*$/) + { + my $varName = $1; + my $quoteChar = $2; + my $endTag = $3 . "\n"; + my $value; + + while () + { + if ($_ eq $endTag) + { + chop $value; + last; + } + else + { + if ($quoteChar eq '"') + { + $_ = &DoPrepass($_, $skipFlag); + } + $value .= $_; + } + } + if (!$skipFlag) + { + $vars{$varName} = $value; + } + } + elsif ($params =~ /^(\w+)="(.*)"\s*$/ or + $params =~ /^(\w+)=(\S*)\s*$/) + { + if (!$skipFlag) + { + $vars{$1} = $2; + } + } + else + { + &Error("Invalid ##set command: '$params'"); + } + } + else + { + &Error("Unrecognized command: '$_'"); + } + } + elsif (!$skipFlag) + { + print; + } + } + return 0; +} + +$optEnable = 1; + +foreach (@ARGV) +{ + if ($optEnable and /^-/) + { + if (/^--$/) + { + $optEnable = 0; + } + elsif (/^-D(\w+)=(.*)$/) + { + $vars{$1} = $2; + } + elsif (/^-I(.*)$/) + { + unshift @incPath, $1; + } + else + { + &Error("Unrecognized option: '$_'"); + } + } + else + { + &DoFile($_); + } +} + +# +# vi: ai ts=4 +# vim: si +# diff --git a/tools/yapp.doc b/tools/yapp.doc new file mode 100644 index 0000000..94dfe4a --- /dev/null +++ b/tools/yapp.doc @@ -0,0 +1,48 @@ +YAPP is a simple macro preprocessor designed to do minor tweaking to +another program's inputs. + +In its input, anything of the form ${foo} is expanded with the variable +named foo. It is an error if ${foo} is not defined. +If you need to escape a dollar sign for some reason, the variable +with the empty string name , ${}, has the value "$". + +The result of macro expansion is *not* re-expanded. Expansion is done only +when definitions are made. + +After variable expansion, lines are checked to see if they are control lines. +Control lines begin with ## (after optional leading whitespace) All such lines are deleted and +do not appear in the output. ### is a comment. Other options +are: + +##set variable=value + +value may have one of the following forms: +token: Trailing whitespace is stripped. The token may not contain +any whitespace. Use quotes if it's complicated. +"string": The string may have embedded quotes, and whitespace after + the closing quote. +<<"DELIM": This is a here-document, and the value is all of the following +lines up until, but not including, the newline that precedes a line +that consists soley of DELIM, for any DELIM string. +The Delim must be in quotes. You have two options: +"DELIM": Expand macros in the body of the here-document. +'DELIM': Do not expand macros in the here-document. + +##include "filename": Insert the named file in place of the current line. + +##if num == num +##if num != num +##if num < num +##if num > num +##if num <= num +##if num >= num +##if token eq token +##if token ne token +##ifdef symbol +##ifndef symbol +##else +##endif +You can figure this one out. Macros in between are expanded as usual +(so the ##else or ##endif may be in a macro expansion), but the result +is ignored. String comparison is allowed only between simple words. +#ifdef symbol is true if ${symbol} is defined.