Summary: By themselves, neither Tesseract nor GPT4 vision preview produce acceptable transcription of scanned German documents. However including Tesseract transcription output in GPT4 prompt substantially improves the final transcription quality.

“NARA Rolls” is a colloquial name given to a collection of captured German World War 2 documents that were microfilmed in the 50s. Here I’m working on T77 roll 619, which I obtained as scanned jpgs after paying National Archives around $130. Just this one roll contains over 1,000 pages, and the entire dataset is approximately 70,000 rolls. This dataset contains a very detailed record of the operations of the Nazi Germany and while technically available to the public, it has been difficult to use in research due to the above described method for obtaining the data, and the format in which it comes.

My journey with this dataset started when I was trying to obtain some data on Radom pistols, quite a trifling matter. After seeing the enormity of this collection, I realized it presents unique insight into the state operations. While not complete, the scale and level of detail of this documentation make it unprecedented. In addition to historic knowledge of WW2, one may also be able to gleam answers to fundamental questions about statehood, like what holds a state together, how the information and decisions flow through the various branches, and what seeds of demise may exist in its institutions. However, to make it possible to gain these insights, researchers will need to be able to query the entire dataset efficiently. I thought about this for awhile, and I don’t think there’s any fundamental barrier to making that happen. Given off the shelf technologies, it is feasible to build NARA Rolls into a multi lingual searchable dataset that can be added to over time.

Not sure where to begin, I wanted to start by comparing two image to text technologies (elsewhere known as OCR, optical character recognition): Tesseract and GPT4, which recently aquired “vision” capabilities - meaning you can submit an image to it and ask it questions about the image. While I try to be methodical, I realize that this is just a snapshot in time of the capabilities of each of these technologies, and so the main benefit of this step is to obtain insights about the data and challenges with OCR. With these insights, I hope to be able to better approach the rest of this project.

For comparison, I selected 25 random files from the dataset of over 1000. When the random number pointed to a file that was not a good candidate for OCR, like a separator title page, I advanced to the next one that might contain a useful text. I may have done it twice or so. The total cost for the initial run of GPT4 was 0.61 USD. It is difficult to come up with comparison criteria, since neither scan is perfect, and I don’t have the patience to transcribe even a single document in the comparison set. Even if there was such golden standard, it would be unclear on how to count missing, misidentified, or extra items. And so I decided to do a subjective count of errors I identify, when reading the documents side by side. Generally, a missing or misidentified word got counted as 1 error, and I ignored minor punctuation issues.

Error counts

Page number GPT4-vision-preview Tesseract 5.3.3
0028 6 20
0053 24 17
0129 99 20

Select qualitative errors

Page number GPT4-vision-preview Tesseract 5.3.3
0028 misspells location Sadkow fails to identify document dates,
fails to identify subsection 3.)
phantom empty lines
0053 completely mistranslates “zweier Kocher beendet” as “wieder beschädigt worden”
“kurtzfristigen Termin” hallucinated to “fortlaufenden Vorräte”
misidentifies location Glowno as Glomun
Haute as Hute multiple times
multiple misidentification of “GG” - General Gouvernment as “66”
phantom empty lines
0129 refused to translate due to personally identifiable info struggles when stamps and non-linear text present

Seeing how both GPT4 and Tesseract were struggling, and that neither was going to be acceptable for this project by itself, I decided to try and include Tesseract extracted text into the GPT4 prompt, and that resulted in a transcription with only 4 errors! The hallucinations were also gone. The total cost for the 25 pages increased to 0.79 USD. I did error counts on the first 7 pages, and finding them acceptable, decided to only spot-check the remaining ones. Chat GPT’s comments and disclaimers were the only issue noted. I modified the prompt to try to keep it from posting the disclaimers, and re-ran all the data. The modified prompt reduced the number of disclaimers from 3 to 1. Since disclaimers would still have to be dealt with, I chose not to include these extra instructions in the final prompt. Note, even though I am not including the data here for brevity, the second run resulted in slightly different output - some pages were transcribed better, some were worse. This can be explained by GPT4’s non-deterministic behavior. It is possible that quality of the transcription could be further improved by comparing, say, 3 different transcription attempts, but at this point I’m not interested in pursuing such approach due to cost and diminishing returns.

Error counts

Page number GPT4-vision-preview with Tesseract Comments
0028 4 English disclaimer, surrounds transcribed text, triple tick delimited
0053 4  
0129 2  
0197 1  
0200 1  
0235 0  
0268 3  
612   English disclaimer in beginning and end, no delimiter
820   English disclaimer, surrounds transcribed text, triple tick delimited

Prompt used in this experiment:

User: Transcribe the German text in this image exactly. Output a line of text per line of text in the document. To assist you in the trancription, below is Tesseract’s attempt at extracting text from this image. Note, Tesseract can be incorrect, but you can use it to help in your transcription. Tesseract text (enclosed in ```): [ triple quoted Tesseract text follows ]

In case someone has more insights, feel free to peruse the data used in this experiment:

NARA page scan GPT4-vision-preview Tesseract 5.3.3 GPT4 with Tesseract
0028.jpg 0028-gpt4.txt 0028-tess.txt 0028-gpt4-with-tess.txt
0053.jpg 0053-gpt4.txt 0053-tess.txt 0053-gpt4-with-tess.txt
0129.jpg 0129-gpt4.txt 0129-tess.txt 0129-gpt4-with-tess.txt
0197.jpg 0197-gpt4.txt 0197-tess.txt 0197-gpt4-with-tess.txt
0200.jpg 0200-gpt4.txt 0200-tess.txt 0200-gpt4-with-tess.txt
0235.jpg 0235-gpt4.txt 0235-tess.txt 0235-gpt4-with-tess.txt
0268.jpg 0268-gpt4.txt 0268-tess.txt 0268-gpt4-with-tess.txt
0305.jpg 0305-gpt4.txt 0305-tess.txt 0305-gpt4-with-tess.txt
0360.jpg 0360-gpt4.txt 0360-tess.txt 0360-gpt4-with-tess.txt
0408.jpg 0408-gpt4.txt 0408-tess.txt 0408-gpt4-with-tess.txt
0413.jpg 0413-gpt4.txt 0413-tess.txt 0413-gpt4-with-tess.txt
0425.jpg 0425-gpt4.txt 0425-tess.txt 0425-gpt4-with-tess.txt
0499.jpg 0499-gpt4.txt 0499-tess.txt 0499-gpt4-with-tess.txt
0598.jpg 0598-gpt4.txt 0598-tess.txt 0598-gpt4-with-tess.txt
0606.jpg 0606-gpt4.txt 0606-tess.txt 0606-gpt4-with-tess.txt
0608.jpg 0608-gpt4.txt 0608-tess.txt 0608-gpt4-with-tess.txt
0612.jpg 0612-gpt4.txt 0612-tess.txt 0612-gpt4-with-tess.txt
0667.jpg 0667-gpt4.txt 0667-tess.txt 0667-gpt4-with-tess.txt
0730.jpg 0730-gpt4.txt 0730-tess.txt 0730-gpt4-with-tess.txt
0820.jpg 0820-gpt4.txt 0820-tess.txt 0820-gpt4-with-tess.txt
0836.jpg 0836-gpt4.txt 0836-tess.txt 0836-gpt4-with-tess.txt
0839.jpg 0839-gpt4.txt 0839-tess.txt 0839-gpt4-with-tess.txt
0894.jpg 0894-gpt4.txt 0894-tess.txt 0894-gpt4-with-tess.txt
0931.jpg 0931-gpt4.txt 0931-tess.txt 0931-gpt4-with-tess.txt
1042.jpg 1042-gpt4.txt 1042-tess.txt 1042-gpt4-with-tess.txt