use UTF-8 encoding
This commit is contained in:
parent
7fe8e1ffe0
commit
1c3623d719
54
README
54
README
@ -25,7 +25,7 @@ cryptographic software is subject to U.S. export control laws and
|
||||
regulations. The new 1997 Commerce Department Export Administration
|
||||
Regulations (EAR) explicitly provide that "A printed book or other printed
|
||||
material setting forth encryption source code is not itself subject to the
|
||||
EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution,
|
||||
EAR." (see 15 C.F.R. §734.3(b)(2)). PGP, in an overabundance of caution,
|
||||
has only made available its source code in a form that is not subject to
|
||||
those regulations. So, books containing cryptographic source code may be
|
||||
published, and after they are published they may be exported, but only
|
||||
@ -167,24 +167,24 @@ The first step to getting OrnniPage 7 to work well is to set it up with
|
||||
options to disable all of its more advanced features for preserving font
|
||||
changes and formatting. Look in the Seffings menu.
|
||||
|
||||
· Create a Zone Contents File with all of ASCII in it, plus the extra
|
||||
· Create a Zone Contents File with all of ASCII in it, plus the extra
|
||||
bullet, currency, yen and pilcrow symbols. Name it "Source Code".
|
||||
· Create a Source Code style set. Within it, create a Source Code zone style
|
||||
· Create a Source Code style set. Within it, create a Source Code zone style
|
||||
and make it the default.
|
||||
· Set the font to something fixed-width, like Courier.
|
||||
· Set a fixed font size (10 point) and plain text, left-aligned.
|
||||
· Set the tab character to a space.
|
||||
· Set the text flow to hard line returns.
|
||||
· Set the margins to their widest.
|
||||
· The font mapping options are irrelevant.
|
||||
· Set the font to something fixed-width, like Courier.
|
||||
· Set a fixed font size (10 point) and plain text, left-aligned.
|
||||
· Set the tab character to a space.
|
||||
· Set the text flow to hard line returns.
|
||||
· Set the margins to their widest.
|
||||
· The font mapping options are irrelevant.
|
||||
|
||||
Go to the settings panel and:
|
||||
|
||||
· Under Scanner, set the brightness to manual. With careful setting of the
|
||||
· Under Scanner, set the brightness to manual. With careful setting of the
|
||||
threshold, this generates much better results than either the automatic
|
||||
threshold or the 3D OCR. Around 144 has been a good setting for us; you
|
||||
may want to start there.
|
||||
· Under OCR, you'll build a training file to use later, but turn off
|
||||
· Under OCR, you'll build a training file to use later, but turn off
|
||||
automatic page orientation and select your Source Code style set in the
|
||||
Output Options. Also set a reasonable reject character. (For test, we
|
||||
used the pi symbol, which came across from the Macintosh as a weird
|
||||
@ -228,26 +228,26 @@ specific Latin-1 characters to be processed.
|
||||
|
||||
They characters most in need of training are as follows:
|
||||
|
||||
· Zero is printed 'slashed.'
|
||||
· Lowercase L has a curled tail to distinguish it clearly from other
|
||||
· Zero is printed 'slashed.'
|
||||
· Lowercase L has a curled tail to distinguish it clearly from other
|
||||
vertical characters like 1 and I.
|
||||
· The or-bar or pipe symbol '|' is printed "broken" with a gap in the
|
||||
· The or-bar or pipe symbol '|' is printed "broken" with a gap in the
|
||||
middle to distinguish it similarly.
|
||||
· The underscore character has little "serifs" on the end to distinguish
|
||||
· The underscore character has little "serifs" on the end to distinguish
|
||||
it from a minus sign. We also raised it a just a tad higher than the
|
||||
normal underscore character, which was too low in the character cell to
|
||||
be reliably seen by OmniPage.
|
||||
· Tabs are printed as a hollow right-pointing triangle, followed by blanks
|
||||
· Tabs are printed as a hollow right-pointing triangle, followed by blanks
|
||||
to the correct alignment position. If not trained enough, OmniPage
|
||||
guesses this is a capital D. You should train OmniPage to recognize this
|
||||
symbol as a currency symbol (Latin-1 244).
|
||||
· Any spaces in the original that follow a space, or a blank on the printed
|
||||
· Any spaces in the original that follow a space, or a blank on the printed
|
||||
page, are printed as a tiny black triangle. You should train OmniPage to
|
||||
recognize this as a center dot or bullet (Latin-1 267). We didn't use a
|
||||
standard center dot because OmniPage confused it with a period.
|
||||
· Any form feeds in the original are printed as a yen currency symbol
|
||||
· Any form feeds in the original are printed as a yen currency symbol
|
||||
(Latin-1 245).
|
||||
· Lines over 80 columns long are broken after 79 columns by appending a big
|
||||
· Lines over 80 columns long are broken after 79 columns by appending a big
|
||||
ugly black block. You should train OmniPage to recognize this as a
|
||||
pilcrow (paragraph symbol, Latin-1 266). We did this because after
|
||||
deciding something black and visible was suitable, we found out the font
|
||||
@ -264,16 +264,16 @@ to train on, use that.
|
||||
|
||||
Other things that need training:
|
||||
|
||||
· ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped
|
||||
· ~ (tilde), ^ (caret), ` (backquote) and ' (quote). These get dropped
|
||||
frequently unless you train them.
|
||||
· i, j and; (semicolon). These get mixed up.
|
||||
· 3 and S. These also get mixed up.
|
||||
· Q can fail to be recognized.
|
||||
· C and [ can be confused.
|
||||
· c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This
|
||||
· i, j and; (semicolon). These get mixed up.
|
||||
· 3 and S. These also get mixed up.
|
||||
· Q can fail to be recognized.
|
||||
· C and [ can be confused.
|
||||
· c/C, o/O, p/P, s/S, u/U, v/V, w/W, y/Y and z/Z are often confused. This
|
||||
can be helped by some training.
|
||||
· r gets confused with c and n. I don't understand c, but it happens.
|
||||
· f gets confused with i.
|
||||
· r gets confused with c and n. I don't understand c, but it happens.
|
||||
· f gets confused with i.
|
||||
|
||||
The OCR training pages have lots of useful examples of troublesome
|
||||
characters. Scan a few pages of material, training each page, then scan a
|
||||
|
Loading…
Reference in New Issue
Block a user