How to get perfect OCR of digital data with 100% accuracy?

by Martin Monperrus

I want to print digital data, to store digital data on paper. There are many use cases for that, such as backuping cryptographic keys (GPG keys on paper) or sending encrypted messages over regular mail (see Paper encryption technology)

By working on paper storage, I discovered that OCR of digital data is a hard problem.

Problems with OCR of digital data

When you OCR digital data, you need perfect recognition, with no error. Otherwise, the resulting scanned data is corrupted and in many cases unusable (an encrypted message cannot be read, a compressed file cannot be decompressed). For OCR of digital data, we need that 100% of characters are correctly recognized (i.e. a character error rate – CER – of 0).

This is challenging, because:

This means that most usual data encodings cannot be recognized with 100% accuracy:

To sum up, OCR of digital data is hard, and the problem space of OCR has several dimensions.

OCR of hexadecimal with 100% accuracy

After weeks of trials and errors, I was finally able to get 100% accuracy for automatically recognizing hexadecimal data (i.e. base16 with the alphabet “0123456789abcedf”).

The solution is as follows:

drawing
Printing and OCRing an encrypted GPG file encoded in base16

With this setup, we can get approximately 2.5 kilobytes per A4 page.

OCR of BIP39 with 100% accuracy

bip39 is a novel encoding scheme coming from the bitcoin world. It has the following characteristics: the data is encoded in 2048 dimensions (as opposed to 16 dimensions for hexadecimal); each dimension is represented by a word from the dictionary (see the dictionary of 2048 english words).

For example the hexadecimal data “abcdabcdabcdabcdabcdabcdabcdabcd” is encoded in bip39 as “profit hood vibrant fiscal survey traffic quality rely soap fury helmet once”

This encoding has a few other nice characteristics: it can be used in several languages (not only in English), the first four letters of the 2048 words are unique.

BIP39 is great for OCR of digital data! First, because it contains an intrinsic redundancy to make error correction, if some characters are not well recognized one can recover them by looking at the other characters in the neighborhood in order to identify a word from the bip39 dictionary. Second, most modern OCR systems are based on machine learning, and implicitly embeds after training some knowledge about the likely sequences of characters (eg it is likely to have “er” but unlikely to have “xz”), and the known words. This means that most off-the-shelf OCR engines natively handle well bip39.

Indeed, I can get 100% accuracy for OCRing digital data as follows.

drawing
Printing binary data encoded in bip39

With this setup, we can get 1 kilobyte per A4 page: this encoding is not efficient in space,but this is the price to pay for having redundancy and zero errors.

OCR of base64 with 100% post-correction

OCR of base64 is a nightmare because it suffers from all possible pitfalls (1/l, O/0, upper-case/lower-case such as x/X; etc.). Unless extreme effort is put on OCR fine-tuning for base64, one cannot get a 100% accuracy.

One alternative solution to OCR base64 is to repair the most common OCR errors (given a font, DPI and OCR engine), see ref (Plumb97). The idea is to prefix every line with a checksum and, as long as the base64 decoded line does match the checksum, to replace all potential errors and try again. A line then looks like this

36913731747 JLFzgczRKrL803MYDJqyexatN25s4E4SUL/cDTj1ANoTfD97dwjy

where the first part is the checksum and the second part the base64 data. The overall repair algorithm is the following.

for all lines:
  trial = line
  while checksum(base64decode(trial)) != checksum:
    trial = try_repair(line) # try to repair some potential errors 

The checksum is key in this architecture.

drawing
Printing binary data encoded in repairable base64

With this setup, we can get 4.2 kilobytes per A4 page with Inconsolata 11pt.

Discussion

OCR of base32 The lowest error rate I’ve achieved so far is 27/6400 characters (0.4%) using GOCR/FreeMono/base32 in lower case.

OCR with error-correction It is possible to use traditional error-correcting code (ecc) to compensate for character errors. Yet, this has two man problem. First, it imposes a siginificant size increase due to redundancy. Using based32 and the Reed-Solomon implementation of fridex, the data is 74% larger and one can accept up to 1.5% of character level errors. Second, the error model of error-correction does not necessarily match the error model of OCR. In particular with OCR, one may have too many extraneous and or too many missing characters that break ECC.

References

Appendix

Finding an optimal font for a given OCR engine. One can simply do a search over all possible fonts to find the least confusing font for a given OCR engine and DPI:

for all available fonts:
  generate a random piece of text 
  OCR it
  compare the original text and the recognized one

The optimal font varies very much on the considered engine. Monospaced fonts (aka fixed-width) such as Inconsolata, are more appropriate in general. ocr-a and ocr-b ocrb give really poor results. AnyOCR.ttf is OK for base32.

GOCR: I found that the best fonts for OCR with GOCR are 1) Free Mono 2) Inconsolata.

Tesseract: Results for Tesseract at https://superuser.com/questions/1096425/optimal-font-for-tesseract-specifically-the-net-wrapper

Finding an optimal base32 alphabet for OCR. Instead of using existing encodings, one can synthesize a tailored encoding for a specific engine. The idea is to find a alphabet that works well by removing all characters that confuse a given OCR engine. In other words, we create a new alphabet by selecting those symbols that are optimal for automated recognition.

I’ve experimented with the following algorithm to identify the alphabet which is the least confusing:

create an initial alphabet with all characters
while (true)
  generate a random piece of text with the alphabet 
  OCR it
  compare the original text and the recognized one
  remove the symbol with the most errors from the alphabet

Typically, the initial alphabet would be composed of all the printable characters (approx. 100 characters in the Latin world).

I’ve done experiments with a few off-the-shelf OCR systems and a few fonts, here are the results. So far bocr32, the best alphabet for OCR in 32 dimensions is !(+./367;?acdefghjkmnoprstwxz}§â (GOCR/Inconsolata)

drawing
Example of bocr32 encoded data