Summary
On pdfs_pdfjs/issue20232.pdf, extract_text emits Cyrillic text as UTF-8 mojibake (н, е, м, РазÑаб, ÐиÑÑ) instead of proper Cyrillic characters. Both v0.3.23 and v0.3.25 produce the same output — this is a pre-existing bug, not a regression.
Reproduction
./target/release/examples/extract_text_simple pdfs_pdfjs/issue20232.pdf
Current output (head of file):
ÐиÑ. ÐаÑÑа ÐаÑÑÑаб
Ðзм. ÐиÑÑ № докÑм. Ðодп. ÐаÑа
РазÑаб.
ÐÑов.
Т. конÑÑ.
ÐиÑÑ ÐиÑÑов 1
Expected: proper Cyrillic characters (this is a Russian engineering drawing title block — Лист, Изм., Разраб., Пров., etc.).
Root-cause hypothesis
The bytes Ð, Ñ, °, ½, µ are the Latin-1 / WinAnsi interpretation of UTF-8-encoded Cyrillic bytes. This typically happens when:
- The font has no
ToUnicode CMap, and
- Fallback decoding treats the raw byte stream as a single-byte encoding (WinAnsi) rather than recognizing the embedded Cyrillic glyphs, or
- The bytes are already UTF-8 but are being re-encoded as UTF-8 a second time.
Worth checking whether the font object has a /Encoding entry pointing at an Identity-H or Cyrillic encoding that's being ignored.
Reference corpus
pdfs_pdfjs/issue20232.pdf (from the pdf.js reference test corpus)
Tested versions
- 0.3.23: broken
- 0.3.25 (release/v0.3.25): broken (same output)
Summary
On
pdfs_pdfjs/issue20232.pdf,extract_textemits Cyrillic text as UTF-8 mojibake (н,е,м,РазÑаб,ÐиÑÑ) instead of proper Cyrillic characters. Both v0.3.23 and v0.3.25 produce the same output — this is a pre-existing bug, not a regression.Reproduction
Current output (head of file):
Expected: proper Cyrillic characters (this is a Russian engineering drawing title block —
Лист,Изм.,Разраб.,Пров., etc.).Root-cause hypothesis
The bytes
Ð,Ñ,°,½,µare the Latin-1 / WinAnsi interpretation of UTF-8-encoded Cyrillic bytes. This typically happens when:ToUnicodeCMap, andWorth checking whether the font object has a
/Encodingentry pointing at an Identity-H or Cyrillic encoding that's being ignored.Reference corpus
pdfs_pdfjs/issue20232.pdf(from the pdf.js reference test corpus)Tested versions