fix: update CID warning hybrid guidance#406
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughUpdated a warning log emitted when a page's replacement-character ratio indicates missing ToUnicode mappings. The message now recommends enabling hybrid OCR fallback using Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thanks — I checked this more closely. The CodeRabbit “Docstring Coverage” item appears to be a generic repository-level warning in the comment summary, not a failing GitHub status for this PR. The actual PR checks are currently green ( Since this change does not introduce new public APIs/functions, I did not add unrelated docstrings just to satisfy a repo-wide coverage warning that is outside the scope of issue #404. If maintainers want, I can still add a separate follow-up PR for broader docstring coverage, but I’d prefer to keep this bug-fix PR focused on the warning guidance change. |
4de0bc3 to
1284f73
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
bundolee
left a comment
There was a problem hiding this comment.
Good catch — the old warning pointed users to --hybrid-mode, which is just a mode selector and doesn't actually enable the hybrid backend. This fix gives them the correct, actionable flag.
Clean change, focused test update. Thanks!
One thought for a future issue: recommending hybrid OCR as the go-to fix for CID font mapping problems may not always be the best advice — OCR can actually be worse than incomplete text extraction on vector-text PDFs. But that's a separate discussion about the diagnostic message itself, not this PR's scope.
Summary
Update the CID-font replacement-character warning to recommend the current hybrid OCR command instead of the obsolete
--hybrid-modeflag.Reproduction
ContentFilterProcessorwarns when a page is mostly replacement characters (U+FFFD), but the warning text still pointed users to--hybrid-mode for OCR fallback.Patch Summary
--hybrid docling-fastCidFontDetectionTestso it asserts the new guidance and rejects the stale wordingRisks
Low. This is a message-only correction plus a focused test update.
Validation
cd java && mvn -pl opendataloader-pdf-core -Dtest=CidFontDetectionTest testResolves #404
Summary by CodeRabbit
Documentation
Tests