Skip to content

cisnlp/GlotOCR-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GlotOCR-bench

HuggingFace Benchmark HuggingFace Results Leaderboard arXiv

A multilingual OCR benchmark covering a wide range of writing scripts, designed to evaluate OCR models across hundreds of languages.


Benchmark & Results

The benchmark dataset is available at: cis-lmu/GlotOCR-bench

Per-model evaluation results are available at: cis-lmu/GlotOCR-bench-v1.0-results


Dataset

All dataset-related code is in the dataset/ folder.

1. Fonts

Download and organize Google Fonts by script by running:

python dataset/fonts/get_fonts.py

Alternatively, you can download the version we used directly from Hugging Face: kargaranamir/google_fonts

2. Seed Text

Place per-script sentence CSVs in dataset/seed/seed_data/. You can generate them from the GlotLID corpus by running:

python dataset/seed/get_seed.py

The GlotLID corpus is available at: cis-lmu/glotlid-corpus

3. Image Generation

We provide two rendering profiles (PLAIN and OLD_DOCUMENT). You can adjust parameters in dataset/ocr_generator/config.py and the rendering logic in dataset/ocr_generator/engine.py, then generate images by running:

python dataset/ocr_generator/main.py

To export the generated images to Parquet format:

python dataset/ocr_generator/export.py

Evaluation

1. Run OCR Models

Run the OCR models on the dataset using the scripts provided at: uv-scripts/ocr

2. Compute Metrics

Once model outputs are ready, compute CER, Acc@k, and ScriptAcc metrics by running:

cd evaluation/metrics
python main.py

Results are saved per model under evaluation/res_v1.0/, including per-script, per-language, and tier-level (high / mid / low resource) breakdowns.


Citation

@misc{kargaran2026glotocrbench,
      title={GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts}, 
      author={Amir Hossein Kargaran and Nafiseh Nikeghbal and Jana Diesner and François Yvon and Hinrich Schütze},
      year={2026},
      eprint={2604.12978},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.12978}, 
}