Vision-First vs. Language-First OCR
on 1950s Khmer Texts
A head-to-head experiment on a pre-standardization Khmer patriotic song reveals a 20-fold accuracy gap — not from visual recognition, but from how each system handles historical orthography.
Experimental Context
To qualitatively demonstrate the advantage of the vision-first architecture, we conducted an experiment on the song «ចំរៀងយោធាយាត្រា» (Military March Song) — a Khmer patriotic song composed in the 1950s during Cambodia's independence era. The text consists of eight lines of verse across three stanzas.
Why this text matters for OCR evaluation:
- Uses pre-standardization orthographic conventions that differ from modern Khmer spelling on several words.
- Represents the typical situation for any Khmer document predating 1975 — the most common archival scenario.
- Exposes a fundamental weakness of language-first systems: they apply modern lexicons to historical text, corrupting the original.
Source document used in the experiment:
Results at a Glance
This 20-fold difference does not reflect a gap in visual recognition capability. It reflects the consequences of a language correction mechanism that imposes modern orthography onto historical text.
Historical Spelling Variants — Side-by-Side
Each row shows a word from the 1950s manuscript and how each system handles it. The language-first system "corrects" historical spellings into modern forms — or worse, into unrelated words entirely.
| Original spelling (1950s) | Modern spelling | Language-first reading | NextOCR reading |
|---|---|---|---|
| រមណិយស្ឋាន | រមណីយស្ថាន | រមណីយ ស្ពាន ✗ | រមណិយស្ឋាន ✓ |
| ប្រទុសរ៉ាយ | ប្រទុសរាយ | ជ្រុះខុសអើយ ✗ | ប្រទុសរ៉ាយ ✓ |
| ស្មគ្រ | ស្ម័គ្រ | ស្មគូ ✗ | ស្មគ្រ ✓ |
| ភូមីរណ | ភូមិរណ | ផ្សភូមិវណ ✗ | ភូមីរណ ✓ |
Note: Language-first errors include not just incorrect spellings but complete misreadings — e.g. ប្រទុសរ៉ាយ read as ជ្រុះខុសអើយ, a phonetically unrelated phrase.
Why Language-First Systems Fail on Historical Text
Language-first systems are trained on modern spelling norms. When they encounter pre-1975 orthography, the correction engine treats valid historical forms as errors and "fixes" them.
When a historical word has no close modern equivalent in the lexicon, the system substitutes a phonetically or visually similar modern word — sometimes completely unrelated in meaning.
Once a historical document is "corrected" by a language-first system, the original orthographic evidence is destroyed. For archival and scholarly work, this is unacceptable.
Conclusion
On this eight-line historical passage, NextOCR produced 1 error versus 20 errors from the traditional language-first system. The difference is not about visual recognition — both systems see the same pixels.
The gap is caused entirely by the language correction layer. NextOCR's vision-first design preserves what is actually written in the document — the correct approach for historical texts, archival digitization, and any domain where original spelling must be faithfully captured.
Try NextOCR on Your Documents
Interested in seeing how NextOCR performs on your historical or domain-specific texts? Get in touch for a demo or integration discussion.
- Email: danhhong@gmail.com
- Phone: (+855) 95 333 409
- Telegram: t.me/hout18