extract_text: Cyrillic text emitted as UTF-8 mojibake (pdfjs issue20232.pdf)

## Summary

On `pdfs_pdfjs/issue20232.pdf`, `extract_text` emits Cyrillic text as UTF-8 mojibake (`Ð½`, `Ðµ`, `Ð¼`, `Ð Ð°Ð·ÑÐ°Ð±`, `ÐÐ¸ÑÑ`) instead of proper Cyrillic characters. Both v0.3.23 and v0.3.25 produce the same output — this is a pre-existing bug, not a regression.

## Reproduction

```bash
./target/release/examples/extract_text_simple pdfs_pdfjs/issue20232.pdf
```

Current output (head of file):
```
ÐÐ¸Ñ. ÐÐ°ÑÑÐ° ÐÐ°ÑÑÑÐ°Ð±
ÐÐ·Ð¼. ÐÐ¸ÑÑ № Ð´Ð¾ÐºÑÐ¼. ÐÐ¾Ð´Ð¿. ÐÐ°ÑÐ°
Ð Ð°Ð·ÑÐ°Ð±.
ÐÑÐ¾Ð².
Ð¢. ÐºÐ¾Ð½ÑÑ.
ÐÐ¸ÑÑ ÐÐ¸ÑÑÐ¾Ð² 1
```

Expected: proper Cyrillic characters (this is a Russian engineering drawing title block — `Лист`, `Изм.`, `Разраб.`, `Пров.`, etc.).

## Root-cause hypothesis

The bytes `Ð`, `Ñ`, `°`, `½`, `µ` are the Latin-1 / WinAnsi interpretation of UTF-8-encoded Cyrillic bytes. This typically happens when:
1. The font has no `ToUnicode` CMap, and
2. Fallback decoding treats the raw byte stream as a single-byte encoding (WinAnsi) rather than recognizing the embedded Cyrillic glyphs, or
3. The bytes are already UTF-8 but are being re-encoded as UTF-8 a second time.

Worth checking whether the font object has a `/Encoding` entry pointing at an Identity-H or Cyrillic encoding that's being ignored.

## Reference corpus

- `pdfs_pdfjs/issue20232.pdf` (from the pdf.js reference test corpus)

## Tested versions

- 0.3.23: broken
- 0.3.25 (release/v0.3.25): broken (same output)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extract_text: Cyrillic text emitted as UTF-8 mojibake (pdfjs issue20232.pdf) #317

Summary

Reproduction

Root-cause hypothesis

Reference corpus

Tested versions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

extract_text: Cyrillic text emitted as UTF-8 mojibake (pdfjs issue20232.pdf) #317

Description

Summary

Reproduction

Root-cause hypothesis

Reference corpus

Tested versions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions