Skip to content

fix: update CID warning hybrid guidance#406

Merged
bundolee merged 1 commit intoopendataloader-project:mainfrom
LocNguyenSGU:fix/issue-404-hybrid-warning-guidance
Apr 15, 2026
Merged

fix: update CID warning hybrid guidance#406
bundolee merged 1 commit intoopendataloader-project:mainfrom
LocNguyenSGU:fix/issue-404-hybrid-warning-guidance

Conversation

@LocNguyenSGU
Copy link
Copy Markdown
Contributor

@LocNguyenSGU LocNguyenSGU commented Apr 10, 2026

Summary

Update the CID-font replacement-character warning to recommend the current hybrid OCR command instead of the obsolete --hybrid-mode flag.

Reproduction

ContentFilterProcessor warns when a page is mostly replacement characters (U+FFFD), but the warning text still pointed users to --hybrid-mode for OCR fallback.

Patch Summary

  • change the warning text to recommend --hybrid docling-fast
  • tighten CidFontDetectionTest so it asserts the new guidance and rejects the stale wording

Risks

Low. This is a message-only correction plus a focused test update.

Validation

  • cd java && mvn -pl opendataloader-pdf-core -Dtest=CidFontDetectionTest test

Resolves #404

Summary by CodeRabbit

  • Documentation

    • Updated warning message to recommend the correct command-line flag for enabling hybrid OCR fallback when replacement character issues are detected.
  • Tests

    • Updated test to validate the improved warning message guidance.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c8f88e72-0b63-4cbf-beb4-18e40657e092

📥 Commits

Reviewing files that changed from the base of the PR and between 4de0bc3 and 1284f73.

📒 Files selected for processing (2)
  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.java
  • java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/CidFontDetectionTest.java

Walkthrough

Updated a warning log emitted when a page's replacement-character ratio indicates missing ToUnicode mappings. The message now recommends enabling hybrid OCR fallback using --hybrid docling-fast (instead of referencing --hybrid-mode). The unit test assertion was adjusted to require the new guidance.

Changes

Cohort / File(s) Summary
Warning Message Update
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.java
Changed the WARNING log text for pages with high replacement-character ratios to recommend --hybrid docling-fast rather than --hybrid-mode for OCR fallback.
Test Assertion Update
java/opendataloader-pdf-core/src/test/java/org/opendataloader/pdf/processors/CidFontDetectionTest.java
Tightened testCidPdfWarningLogEmitted to assert at least one WARNING log contains --hybrid docling-fast and does not include the old --hybrid-mode for OCR fallback substring; updated failure message accordingly.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • MaximPlusov
  • hyunhee-jo
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: updating the CID warning guidance from --hybrid-mode to --hybrid docling-fast, which is the primary objective of the PR.
Linked Issues check ✅ Passed The PR directly addresses all coding requirements from issue #404: updating the warning text in ContentFilterProcessor to recommend --hybrid docling-fast instead of --hybrid-mode, and tightening the test to validate the new guidance.
Out of Scope Changes check ✅ Passed All changes are directly related to issue #404: the warning message update and focused test assertion changes are within scope, with no unrelated modifications introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 10, 2026

CLA assistant check
All committers have signed the CLA.

@LocNguyenSGU
Copy link
Copy Markdown
Contributor Author

Thanks — I checked this more closely.

The CodeRabbit “Docstring Coverage” item appears to be a generic repository-level warning in the comment summary, not a failing GitHub status for this PR. The actual PR checks are currently green (CodeRabbit, license/cla), and this patch only updates an existing warning string plus its focused test assertion.

Since this change does not introduce new public APIs/functions, I did not add unrelated docstrings just to satisfy a repo-wide coverage warning that is outside the scope of issue #404.

If maintainers want, I can still add a separate follow-up PR for broader docstring coverage, but I’d prefer to keep this bug-fix PR focused on the warning guidance change.

@bundolee bundolee force-pushed the fix/issue-404-hybrid-warning-guidance branch from 4de0bc3 to 1284f73 Compare April 15, 2026 10:39
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@bundolee bundolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the old warning pointed users to --hybrid-mode, which is just a mode selector and doesn't actually enable the hybrid backend. This fix gives them the correct, actionable flag.

Clean change, focused test update. Thanks!

One thought for a future issue: recommending hybrid OCR as the go-to fix for CID font mapping problems may not always be the best advice — OCR can actually be worse than incomplete text extraction on vector-text PDFs. But that's a separate discussion about the diagnostic message itself, not this PR's scope.

@bundolee bundolee merged commit feaec4f into opendataloader-project:main Apr 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warning suggests --hybrid-mode for OCR fallback, but hybrid backend is not enabled without --hybrid

3 participants