Skip to content

extract_text: glyph-order transposition fragments on multi-column textbooks #319

@yfedoseev

Description

@yfedoseev

Summary

On multi-column academic textbooks with dense figures and tables, extract_text emits hundreds of tokens that look like the right letters in the wrong order — e.g. aaxons (axons), acclipmoshed (accomplished), accpamonies (accompanies), achrmaotic (achromatic), acNrouens (neurons), Actiatvion (Activation), actualyl (actually), aDostokysve (Dostoyevsky).

v0.3.25 (#316 rowspan reading-order fix) dramatically reduces the count — ~2,100 such fragments disappear from Principles of Neural Science alone. But residual transpositions remain, and the same class of bug surfaces on other dense textbooks in the corpus.

Reproduction

./target/release/examples/extract_text_simple \
  "pdfs_slow4/Principles of Neural Science, Sixth Edition -- Eric R. Kandel (editor); Steven Siegelbaum (editor); Sarah -- ( WeLib.org ).pdf" \
  | grep -oE '[A-Za-z]{6,}' | sort -u > /tmp/words.txt

# Real word counts are intact
grep -c '^axons$' /tmp/words.txt          # 1
grep -c '^neurons$' /tmp/words.txt        # 1
# But so are the garbled fragments
grep -cE '^(aaxons|acclipmoshed|accpamonies|achrmaotic)$' /tmp/words.txt   # 4

Root-cause hypothesis

These look like neighboring glyph objects being emitted in writing-order rather than visual-order within a single TJ or TJ-array, probably caused by:

  • Per-glyph positioning where the content stream walks glyphs in a non-linear order (e.g. kerning-optimized output from a typesetter), and the reading-order pass groups them by y-band before fixing left-to-right sort within the band.
  • Overlapping figure callouts / annotations whose glyph positions are within the y-band tolerance of body text.

Test case: pairs of letters consistently end up swapped (aa / tv / yl) rather than long runs being scrambled — suggests a sort comparator that's unstable or that compares x-coordinates with a too-loose tolerance.

Reference corpus

  • pdfs_slow4/Principles of Neural Science (major reduction but residual in v0.3.25)
  • pdfs_slow3/The Advantage (Lencioni) — similar glyph-order fragments, not measurably improved by v0.3.25
  • Kevin Murphy Machine Learning: A Probabilistic Perspective — different manifestation (concatenated unspaced words like groupregularization, equationOptimizing)

Tested versions

  • 0.3.23: ~2,100 fragments on Principles of Neural Science
  • 0.3.25 (release/v0.3.25): ~few dozen residual fragments on Principles of Neural Science; pdfs_slow3 and Murphy textbook largely unchanged

Priority

Medium — degrades downstream NLP / search / LLM context quality but does not lose content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions