How to get perfect OCR of digital data with 100% accuracy?

I want to print digital data, to store digital data on paper. There are many use cases for that, such as backuping cryptographic keys (GPG keys on paper) or sending encrypted messages over regular mail (see Paper encryption technology)

By working on paper storage, I discovered that OCR of digital data is a hard problem.

Problems with OCR of digital data

When you OCR digital data, you need perfect recognition, with no error. Otherwise, the resulting scanned data is corrupted and in many cases unusable (an encrypted message cannot be read, a compressed file cannot be decompressed). For OCR of digital data, we need that 100% of characters are correctly recognized (i.e. a character error rate – CER – of 0).

This is challenging, because:

OCR is confused by similarly looking characters (eg “1” and “l”)
OCR is confused by the letters that look very similar in upper-case and lower case, (eg “s” and “S”)

This means that most usual data encodings cannot be recognized with 100% accuracy:

base64 is a nightmare, in particular because of the upper-case/lower-case problem.
base32 does not work because there are a few pairs of confusing characters (“1” and “l”, “0” and “o”).
base16/hexadecimal is still tough because a few characters tend to confuse OCR engines (“1” is sometimes recognized as “l”, “7” as “J”)

To sum up, OCR of digital data is hard, and the problem space of OCR has several dimensions.

The OCR engine considered (eg. GOCR, Tesseract, OCRad, …)
The data encoding scheme (eg. base16, base32, zbase32, base24, base58)
The font that is used (eg. Arial, Times, ..)
The font size (eg. 9pt, 10pt, …)
The scanning quality (300DPI, 400DPI, …)
The paper size (A4 page, microfilm, …)

OCR of hexadecimal with 100% accuracy

After weeks of trials and errors, I was finally able to get 100% accuracy for automatically recognizing hexadecimal data (i.e. base16 with the alphabet “0123456789abcedf”).

The solution is as follows:

OCR engine: GOCR by Joerg Schulenburg
Encoding: base16 in lower case (lower case recognition works better in GOCR)
Font: Inconsolata
Font size: 12pt
Scanning quality: 400 DPI (works better than 600 DPI for GOCR)
Paper size: A4 page

drawing
Printing and OCRing an encrypted GPG file encoded in base16

With this setup, we can get approximately 2.5 kilobytes per A4 page.

OCR of BIP39 with 100% accuracy

bip39 is a novel encoding scheme coming from the bitcoin world. It has the following characteristics: the data is encoded in 2048 dimensions (as opposed to 16 dimensions for hexadecimal); each dimension is represented by a word from the dictionary (see the dictionary of 2048 english words).

For example the hexadecimal data “abcdabcdabcdabcdabcdabcdabcdabcd” is encoded in bip39 as “profit hood vibrant fiscal survey traffic quality rely soap fury helmet once”

This encoding has a few other nice characteristics: it can be used in several languages (not only in English), the first four letters of the 2048 words are unique.

BIP39 is great for OCR of digital data! First, because it contains an intrinsic redundancy to make error correction, if some characters are not well recognized one can recover them by looking at the other characters in the neighborhood in order to identify a word from the bip39 dictionary. Second, most modern OCR systems are based on machine learning, and implicitly embeds after training some knowledge about the likely sequences of characters (eg it is likely to have “er” but unlikely to have “xz”), and the known words. This means that most off-the-shelf OCR engines natively handle well bip39.

Indeed, I can get 100% accuracy for OCRing digital data as follows.

OCR engine: Tesseract
Encoding: bip39 in lower case
Font: Inconsolata
Font size: 12pt
Scanning quality: 300 DPI
Paper size: A4 page

drawing
Printing binary data encoded in bip39

With this setup, we can get 1 kilobyte per A4 page: this encoding is not efficient in space,but this is the price to pay for having redundancy and zero errors.

OCR of base64 with 100% post-correction

OCR of base64 is a nightmare because it suffers from all possible pitfalls (1/l, O/0, upper-case/lower-case such as x/X; etc.). Unless extreme effort is put on OCR fine-tuning for base64, one cannot get a 100% accuracy.

One alternative solution to OCR base64 is to repair the most common OCR errors (given a font, DPI and OCR engine), see ref (Plumb97). The idea is to prefix every line with a checksum and, as long as the base64 decoded line does match the checksum, to replace all potential errors and try again. A line then looks like this

36913731747 JLFzgczRKrL803MYDJqyexatN25s4E4SUL/cDTj1ANoTfD97dwjy

where the first part is the checksum and the second part the base64 data. The overall repair algorithm is the following.

for all lines:
  trial = line
  while checksum(base64decode(trial)) != checksum:
    trial = try_repair(line) # try to repair some potential errors

The checksum is key in this architecture.

Checksum algorithm As checksum, I use CRC32, which fits well the potential randomness of errors, and it can be computed fast.
Checksum representation The checksum must be written in an alphabet that does not contain error. I propose to write the checksum with digits, because they can be overall well recognized with OCR (i.e. digits can be OCRed with very few errors).
Errors in checksum What happens if the checksum itself is not correctly OCRed? If this happens, we would repair the base64 line against the wrong oracle. To overcome this, we need a checksum on the checksum. I use a Damm checksum, which means the first 10 digits is the CRC32 and the last digit is the Damm checksum. In 36913731747 above, 3691373174 is the CRC32 and 7 is the Damm checksum.

drawing
Printing binary data encoded in repairable base64

With this setup, we can get 4.2 kilobytes per A4 page with Inconsolata 11pt.

Discussion

OCR of base32 The lowest error rate I’ve achieved so far is 27/6400 characters (0.4%) using GOCR/FreeMono/base32 in lower case.

OCR with error-correction It is possible to use traditional error-correcting code (ecc) to compensate for character errors. Yet, this has two man problem. First, it imposes a siginificant size increase due to redundancy. Using based32 and the Reed-Solomon implementation of fridex, the data is 74% larger and one can accept up to 1.5% of character level errors. Second, the error model of error-correction does not necessarily match the error model of OCR. In particular with OCR, one may have too many extraneous and or too many missing characters that break ECC.

References

(Voting) OCR with voting: an idea to get 100% recognition accuracy, suggested by Frank Puppe, is to use several OCR engines and vote on the output, see “Voting-based ocr system”, “A multi-evidence, multi-engine OCR system”.
(Plumb97) OCR for source code: to share the code of PGP in the 90ies, Phil Zimmermann had to develop and write Tools for Publishing Source Code via OCR, where the key idea is to repair the OCR errors via postprocessing based on line checksums.
(Schilke) In Long-term archiving of digital data on microfilm (2010), Schilke and Rauber put base64 and paperback data on microfilm. At 24 pt, they can put 1.8kB per page with base64. With paperback and 40DPI printing resolution, they reach 12.9kB / page. They confirm that OCR base64 is a challenge.
(Guimbretière) Paper Augmented Digital Documents (2003) Guimbretière argues for formats that are appropriate for both humans and computers. This is the same idea as using characters that are readable by both humans and machines
(Wikinaut) Wikinaut’s notes is a gold mine with tons of links, in particular related to paper-storage of GPG and bitcoin cryptographic keys.
(Perianez-Pascual20) In Towards the optical character recognition of DSLs, Perianez-Pascual et al. describe an original use case of OCR with formal alphabets: recognizing software models.

Appendix

Finding an optimal font for a given OCR engine. One can simply do a search over all possible fonts to find the least confusing font for a given OCR engine and DPI:

for all available fonts:
  generate a random piece of text 
  OCR it
  compare the original text and the recognized one

The optimal font varies very much on the considered engine. Monospaced fonts (aka fixed-width) such as Inconsolata, are more appropriate in general. ocr-a and ocr-b ocrb give really poor results. AnyOCR.ttf is OK for base32.

GOCR: I found that the best fonts for OCR with GOCR are 1) Free Mono 2) Inconsolata.

Tesseract: Results for Tesseract at https://superuser.com/questions/1096425/optimal-font-for-tesseract-specifically-the-net-wrapper

Finding an optimal base32 alphabet for OCR. Instead of using existing encodings, one can synthesize a tailored encoding for a specific engine. The idea is to find a alphabet that works well by removing all characters that confuse a given OCR engine. In other words, we create a new alphabet by selecting those symbols that are optimal for automated recognition.

I’ve experimented with the following algorithm to identify the alphabet which is the least confusing:

create an initial alphabet with all characters
while (true)
  generate a random piece of text with the alphabet 
  OCR it
  compare the original text and the recognized one
  remove the symbol with the most errors from the alphabet

Typically, the initial alphabet would be composed of all the printable characters (approx. 100 characters in the Latin world).

I’ve done experiments with a few off-the-shelf OCR systems and a few fonts, here are the results. So far bocr32, the best alphabet for OCR in 32 dimensions is !(+./367;?acdefghjkmnoprstwxz}§â (GOCR/Inconsolata)

drawing
Example of bocr32 encoded data