The People Behind NextOCR's Data

Vision-first OCR is only as good as the ground-truth it learns from. Our AI trainers combine linguistic scholarship, religious and historical expertise, and careful annotation to build a data foundation no generic crowd-labeling pipeline can match.

Domain experts in Khmer, Lao, Thai, Lao Tham, Chinese, and Vietnamese scripts — including specialists in post-Angkor manuscripts and Buddhist texts.

Free trial Price Contact

Expert-curated

Linguists & scholars, not anonymous crowdworkers

Manuscript-grade

Post-Angkor, palm-leaf, and historical text expertise

Multilingual

Khmer, Lao, Thai, Chinese, Vietnamese

Why Expert-Led Data Matters

Most OCR datasets are labeled by general-purpose crowdworkers with no background in the scripts or texts they're transcribing. This works for simple printed Latin text — it fails for historical Khmer orthography, religious manuscripts, and low-resource scripts.

Generic crowd labeling (common)

No linguistic or historical background
Misreads rare characters & archaic spelling
Cannot judge manuscript or religious context

NextOCR AI Trainers

Ph.D linguists in Khmer, Lao, Thai & Lao Tham scripts
Buddhist scholars for post-Angkor manuscripts
Native-level labelers across Khmer, Chinese & Vietnamese

This is the same "clean, curated data" philosophy behind NextOCR's vision-first architecture — applied to the people who build the training data itself.

Our AI Trainers

Lek Chumnor

Ph.D, Education

(BA in Lao Language and Linguistics, MA in Thai, and Ph.D. in Education), experienced calligrapher/type designer. Brings deep philological expertise to ground-truth quality control for Southeast Asian scripts with complex orthographic systems.

Khmer Lao Thai Lao Tham

Vann Chansaren

Religious & Buddhist Text Expert

In charge of Khmer manuscripts and old texts from the post-Angkor period. Provides religious and historical context essential for accurately labeling palm-leaf and archival sources.

Khmer Post-Angkor Manuscripts

E Tula

Data Labeler

Handles data labeling across Khmer, Chinese, and Vietnamese, supporting NextOCR's multilingual training roadmap with consistent, high-quality annotations.

Khmer Chinese Vietnamese

How Labeling Works

Source documents are reviewed for script, period, and content sensitivity
Expert trainers transcribe text exactly as it appears in the image
Religious, historical, and linguistic specialists validate ambiguous or archaic cases
Approved labels become part of NextOCR's curated training data

Contact

Interested in contributing as an AI trainer or labeler? Get in touch.

Email: danhhong@gmail.com
Phone: (+855) 95 333 409
Telegram: t.me/hout18