extract_text: glyph-order transposition fragments on multi-column textbooks

## Summary

On multi-column academic textbooks with dense figures and tables, `extract_text` emits hundreds of tokens that look like the right letters in the wrong order — e.g. `aaxons` (axons), `acclipmoshed` (accomplished), `accpamonies` (accompanies), `achrmaotic` (achromatic), `acNrouens` (neurons), `Actiatvion` (Activation), `actualyl` (actually), `aDostokysve` (Dostoyevsky).

v0.3.25 (#316 rowspan reading-order fix) dramatically reduces the count — ~2,100 such fragments disappear from *Principles of Neural Science* alone. But residual transpositions remain, and the same class of bug surfaces on other dense textbooks in the corpus.

## Reproduction

```bash
./target/release/examples/extract_text_simple \
  "pdfs_slow4/Principles of Neural Science, Sixth Edition -- Eric R. Kandel (editor); Steven Siegelbaum (editor); Sarah -- ( WeLib.org ).pdf" \
  | grep -oE '[A-Za-z]{6,}' | sort -u > /tmp/words.txt

# Real word counts are intact
grep -c '^axons$' /tmp/words.txt          # 1
grep -c '^neurons$' /tmp/words.txt        # 1
# But so are the garbled fragments
grep -cE '^(aaxons|acclipmoshed|accpamonies|achrmaotic)$' /tmp/words.txt   # 4
```

## Root-cause hypothesis

These look like neighboring glyph objects being emitted in writing-order rather than visual-order within a single TJ or TJ-array, probably caused by:

- Per-glyph positioning where the content stream walks glyphs in a non-linear order (e.g. kerning-optimized output from a typesetter), and the reading-order pass groups them by y-band before fixing left-to-right sort within the band.
- Overlapping figure callouts / annotations whose glyph positions are within the y-band tolerance of body text.

Test case: pairs of letters consistently end up swapped (`aa` / `tv` / `yl`) rather than long runs being scrambled — suggests a sort comparator that's unstable or that compares x-coordinates with a too-loose tolerance.

## Reference corpus

- `pdfs_slow4/Principles of Neural Science` (major reduction but residual in v0.3.25)
- `pdfs_slow3/The Advantage (Lencioni)` — similar glyph-order fragments, not measurably improved by v0.3.25
- Kevin Murphy *Machine Learning: A Probabilistic Perspective* — different manifestation (concatenated unspaced words like `groupregularization`, `equationOptimizing`)

## Tested versions

- 0.3.23: ~2,100 fragments on Principles of Neural Science
- 0.3.25 (release/v0.3.25): ~few dozen residual fragments on Principles of Neural Science; `pdfs_slow3` and Murphy textbook largely unchanged

## Priority

Medium — degrades downstream NLP / search / LLM context quality but does not lose content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extract_text: glyph-order transposition fragments on multi-column textbooks #319

Summary

Reproduction

Root-cause hypothesis

Reference corpus

Tested versions

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

extract_text: glyph-order transposition fragments on multi-column textbooks #319

Description

Summary

Reproduction

Root-cause hypothesis

Reference corpus

Tested versions

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions