GlotOCR-bench

A multilingual OCR benchmark covering a wide range of writing scripts, designed to evaluate OCR models across hundreds of languages.

Benchmark & Results

The benchmark dataset is available at: cis-lmu/GlotOCR-bench

Per-model evaluation results are available at: cis-lmu/GlotOCR-bench-v1.0-results

Dataset

All dataset-related code is in the dataset/ folder.

1. Fonts

Download and organize Google Fonts by script by running:

python dataset/fonts/get_fonts.py

Alternatively, you can download the version we used directly from Hugging Face: kargaranamir/google_fonts

2. Seed Text

Place per-script sentence CSVs in dataset/seed/seed_data/. You can generate them from the GlotLID corpus by running:

python dataset/seed/get_seed.py

The GlotLID corpus is available at: cis-lmu/glotlid-corpus

3. Image Generation

We provide two rendering profiles (PLAIN and OLD_DOCUMENT). You can adjust parameters in dataset/ocr_generator/config.py and the rendering logic in dataset/ocr_generator/engine.py, then generate images by running:

python dataset/ocr_generator/main.py

To export the generated images to Parquet format:

python dataset/ocr_generator/export.py

Evaluation

1. Run OCR Models

Run the OCR models on the dataset using the scripts provided at: uv-scripts/ocr

2. Compute Metrics

Once model outputs are ready, compute CER, Acc@k, and ScriptAcc metrics by running:

cd evaluation/metrics
python main.py

Results are saved per model under evaluation/res_v1.0/, including per-script, per-language, and tier-level (high / mid / low resource) breakdowns.

Citation

@misc{kargaran2026glotocrbench,
      title={GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts}, 
      author={Amir Hossein Kargaran and Nafiseh Nikeghbal and Jana Diesner and François Yvon and Hinrich Schütze},
      year={2026},
      eprint={2604.12978},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.12978}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GlotOCR-bench

Benchmark & Results

Dataset

1. Fonts

2. Seed Text

3. Image Generation

Evaluation

1. Run OCR Models

2. Compute Metrics

Citation

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GlotOCR-bench

Benchmark & Results

Dataset

1. Fonts

2. Seed Text

3. Image Generation

Evaluation

1. Run OCR Models

2. Compute Metrics

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages