Case Study Historical Khmer OCR

Vision-First vs. Language-First OCR
on 1950s Khmer Texts

A head-to-head experiment on a pre-standardization Khmer patriotic song reveals a 20-fold accuracy gap — not from visual recognition, but from how each system handles historical orthography.

Published: 12 June 2026

1
error — NextOCR on 8-line passage
20
errors — traditional language-first OCR
20×
fewer errors with the vision-first approach

Experimental Context

To qualitatively demonstrate the advantage of the vision-first architecture, we conducted an experiment on the song «ចំរៀងយោធាយាត្រា» (Military March Song) — a Khmer patriotic song composed in the 1950s during Cambodia's independence era. The text consists of eight lines of verse across three stanzas.

Why this text matters for OCR evaluation:

  • Uses pre-standardization orthographic conventions that differ from modern Khmer spelling on several words.
  • Represents the typical situation for any Khmer document predating 1975 — the most common archival scenario.
  • Exposes a fundamental weakness of language-first systems: they apply modern lexicons to historical text, corrupting the original.

Source document used in the experiment:

Original 1950s Khmer manuscript — ចំរៀងយោធាយាត្រា
📄 Original manuscript scan — 1950s Khmer (pre-standardization)

Results at a Glance

NextOCR Vision-first
1
error out of 8 lines
Traditional OCR Language-first
20
errors out of 8 lines

This 20-fold difference does not reflect a gap in visual recognition capability. It reflects the consequences of a language correction mechanism that imposes modern orthography onto historical text.

NextOCR output on the 1950s manuscript
✅ NextOCR output — 1 error, historical spelling preserved
Traditional OCR output on the 1950s manuscript
❌ Traditional OCR output — 20 errors, modern lexicon imposed

Historical Spelling Variants — Side-by-Side

Each row shows a word from the 1950s manuscript and how each system handles it. The language-first system "corrects" historical spellings into modern forms — or worse, into unrelated words entirely.

Original spelling (1950s) Modern spelling Language-first reading NextOCR reading
រមណិយស្ឋាន រមណីយស្ថាន រមណីយ ស្ពាន រមណិយស្ឋាន
ប្រទុសរ៉ាយ ប្រទុសរាយ ជ្រុះខុសអើយ ប្រទុសរ៉ាយ
ស្មគ្រ ស្ម័គ្រ ស្មគូ ស្មគ្រ
ភូមីរណ ភូមិរណ ផ្សភូមិវណ ភូមីរណ

Note: Language-first errors include not just incorrect spellings but complete misreadings — e.g. ប្រទុសរ៉ាយ read as ជ្រុះខុសអើយ, a phonetically unrelated phrase.

Why Language-First Systems Fail on Historical Text

Modern Lexicon Bias

Language-first systems are trained on modern spelling norms. When they encounter pre-1975 orthography, the correction engine treats valid historical forms as errors and "fixes" them.

Hallucinated Words

When a historical word has no close modern equivalent in the lexicon, the system substitutes a phonetically or visually similar modern word — sometimes completely unrelated in meaning.

Irreversible Alteration

Once a historical document is "corrected" by a language-first system, the original orthographic evidence is destroyed. For archival and scholarly work, this is unacceptable.

Conclusion

On this eight-line historical passage, NextOCR produced 1 error versus 20 errors from the traditional language-first system. The difference is not about visual recognition — both systems see the same pixels.

The gap is caused entirely by the language correction layer. NextOCR's vision-first design preserves what is actually written in the document — the correct approach for historical texts, archival digitization, and any domain where original spelling must be faithfully captured.

Try NextOCR on Your Documents

Interested in seeing how NextOCR performs on your historical or domain-specific texts? Get in touch for a demo or integration discussion.

  • Email: danhhong@gmail.com
  • Phone: (+855) 95 333 409
  • Telegram: t.me/hout18