A multilingual OCR benchmark covering a wide range of writing scripts, designed to evaluate OCR models across hundreds of languages.
The benchmark dataset is available at: cis-lmu/GlotOCR-bench
Per-model evaluation results are available at: cis-lmu/GlotOCR-bench-v1.0-results
All dataset-related code is in the dataset/ folder.
Download and organize Google Fonts by script by running:
python dataset/fonts/get_fonts.pyAlternatively, you can download the version we used directly from Hugging Face: kargaranamir/google_fonts
Place per-script sentence CSVs in dataset/seed/seed_data/. You can generate them from the GlotLID corpus by running:
python dataset/seed/get_seed.pyThe GlotLID corpus is available at: cis-lmu/glotlid-corpus
We provide two rendering profiles (PLAIN and OLD_DOCUMENT). You can adjust parameters in dataset/ocr_generator/config.py and the rendering logic in dataset/ocr_generator/engine.py, then generate images by running:
python dataset/ocr_generator/main.pyTo export the generated images to Parquet format:
python dataset/ocr_generator/export.pyRun the OCR models on the dataset using the scripts provided at: uv-scripts/ocr
Once model outputs are ready, compute CER, Acc@k, and ScriptAcc metrics by running:
cd evaluation/metrics
python main.pyResults are saved per model under evaluation/res_v1.0/, including per-script, per-language, and tier-level (high / mid / low resource) breakdowns.
@misc{kargaran2026glotocrbench,
title={GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts},
author={Amir Hossein Kargaran and Nafiseh Nikeghbal and Jana Diesner and François Yvon and Hinrich Schütze},
year={2026},
eprint={2604.12978},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.12978},
}