Summary
On multi-column academic textbooks with dense figures and tables, extract_text emits hundreds of tokens that look like the right letters in the wrong order — e.g. aaxons (axons), acclipmoshed (accomplished), accpamonies (accompanies), achrmaotic (achromatic), acNrouens (neurons), Actiatvion (Activation), actualyl (actually), aDostokysve (Dostoyevsky).
v0.3.25 (#316 rowspan reading-order fix) dramatically reduces the count — ~2,100 such fragments disappear from Principles of Neural Science alone. But residual transpositions remain, and the same class of bug surfaces on other dense textbooks in the corpus.
Reproduction
./target/release/examples/extract_text_simple \
"pdfs_slow4/Principles of Neural Science, Sixth Edition -- Eric R. Kandel (editor); Steven Siegelbaum (editor); Sarah -- ( WeLib.org ).pdf" \
| grep -oE '[A-Za-z]{6,}' | sort -u > /tmp/words.txt
# Real word counts are intact
grep -c '^axons$' /tmp/words.txt # 1
grep -c '^neurons$' /tmp/words.txt # 1
# But so are the garbled fragments
grep -cE '^(aaxons|acclipmoshed|accpamonies|achrmaotic)$' /tmp/words.txt # 4
Root-cause hypothesis
These look like neighboring glyph objects being emitted in writing-order rather than visual-order within a single TJ or TJ-array, probably caused by:
- Per-glyph positioning where the content stream walks glyphs in a non-linear order (e.g. kerning-optimized output from a typesetter), and the reading-order pass groups them by y-band before fixing left-to-right sort within the band.
- Overlapping figure callouts / annotations whose glyph positions are within the y-band tolerance of body text.
Test case: pairs of letters consistently end up swapped (aa / tv / yl) rather than long runs being scrambled — suggests a sort comparator that's unstable or that compares x-coordinates with a too-loose tolerance.
Reference corpus
pdfs_slow4/Principles of Neural Science (major reduction but residual in v0.3.25)
pdfs_slow3/The Advantage (Lencioni) — similar glyph-order fragments, not measurably improved by v0.3.25
- Kevin Murphy Machine Learning: A Probabilistic Perspective — different manifestation (concatenated unspaced words like
groupregularization, equationOptimizing)
Tested versions
- 0.3.23: ~2,100 fragments on Principles of Neural Science
- 0.3.25 (release/v0.3.25): ~few dozen residual fragments on Principles of Neural Science;
pdfs_slow3 and Murphy textbook largely unchanged
Priority
Medium — degrades downstream NLP / search / LLM context quality but does not lose content.
Summary
On multi-column academic textbooks with dense figures and tables,
extract_textemits hundreds of tokens that look like the right letters in the wrong order — e.g.aaxons(axons),acclipmoshed(accomplished),accpamonies(accompanies),achrmaotic(achromatic),acNrouens(neurons),Actiatvion(Activation),actualyl(actually),aDostokysve(Dostoyevsky).v0.3.25 (#316 rowspan reading-order fix) dramatically reduces the count — ~2,100 such fragments disappear from Principles of Neural Science alone. But residual transpositions remain, and the same class of bug surfaces on other dense textbooks in the corpus.
Reproduction
Root-cause hypothesis
These look like neighboring glyph objects being emitted in writing-order rather than visual-order within a single TJ or TJ-array, probably caused by:
Test case: pairs of letters consistently end up swapped (
aa/tv/yl) rather than long runs being scrambled — suggests a sort comparator that's unstable or that compares x-coordinates with a too-loose tolerance.Reference corpus
pdfs_slow4/Principles of Neural Science(major reduction but residual in v0.3.25)pdfs_slow3/The Advantage (Lencioni)— similar glyph-order fragments, not measurably improved by v0.3.25groupregularization,equationOptimizing)Tested versions
pdfs_slow3and Murphy textbook largely unchangedPriority
Medium — degrades downstream NLP / search / LLM context quality but does not lose content.