The People Behind NextOCR's Data

Vision-first OCR is only as good as the ground-truth it learns from. Our AI trainers combine linguistic scholarship, religious and historical expertise, and careful annotation to build a data foundation no generic crowd-labeling pipeline can match.

Domain experts in Khmer, Lao, Thai, Lao Tham, Chinese, and Vietnamese scripts — including specialists in post-Angkor manuscripts and Buddhist texts.

Expert-curated
Linguists & scholars, not anonymous crowdworkers
Manuscript-grade
Post-Angkor, palm-leaf, and historical text expertise
Multilingual
Khmer, Lao, Thai, Chinese, Vietnamese

Why Expert-Led Data Matters

Most OCR datasets are labeled by general-purpose crowdworkers with no background in the scripts or texts they're transcribing. This works for simple printed Latin text — it fails for historical Khmer orthography, religious manuscripts, and low-resource scripts.

Generic crowd labeling (common)
  • No linguistic or historical background
  • Misreads rare characters & archaic spelling
  • Cannot judge manuscript or religious context
NextOCR AI Trainers
  • Ph.D linguists in Khmer, Lao, Thai & Lao Tham scripts
  • Buddhist scholars for post-Angkor manuscripts
  • Native-level labelers across Khmer, Chinese & Vietnamese

This is the same "clean, curated data" philosophy behind NextOCR's vision-first architecture — applied to the people who build the training data itself.

Our AI Trainers

LC
Lek Chumnor
Ph.D, Education

(BA in Lao Language and Linguistics, MA in Thai, and Ph.D. in Education), experienced calligrapher/type designer. Brings deep philological expertise to ground-truth quality control for Southeast Asian scripts with complex orthographic systems.

Khmer Lao Thai Lao Tham
VC
Vann Chansaren
Religious & Buddhist Text Expert

In charge of Khmer manuscripts and old texts from the post-Angkor period. Provides religious and historical context essential for accurately labeling palm-leaf and archival sources.

Khmer Post-Angkor Manuscripts
ET
E Tula
Data Labeler

Handles data labeling across Khmer, Chinese, and Vietnamese, supporting NextOCR's multilingual training roadmap with consistent, high-quality annotations.

Khmer Chinese Vietnamese

How Labeling Works

  • Source documents are reviewed for script, period, and content sensitivity
  • Expert trainers transcribe text exactly as it appears in the image
  • Religious, historical, and linguistic specialists validate ambiguous or archaic cases
  • Approved labels become part of NextOCR's curated training data

Contact

Interested in contributing as an AI trainer or labeler? Get in touch.

  • Email: danhhong@gmail.com
  • Phone: (+855) 95 333 409
  • Telegram: t.me/hout18