Summary
When OpenDataLoader detects CID-keyed fonts without ToUnicode mappings, it logs a warning that says:
Text extraction may be incomplete. Consider using --hybrid-mode for OCR fallback.
This is misleading because --hybrid-mode alone does not enable the hybrid backend. According to the CLI options and docs, users must also pass --hybrid <backend> (for example --hybrid docling-fast) for OCR/backend processing to happen at all.
Current Behavior
The warning in ContentFilterProcessor currently says:
"Text extraction may be incomplete. Consider using --hybrid-mode for OCR fallback."
Location:
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.java
Why this is a problem
--hybrid-mode is only a mode selector (auto / full), not the switch that enables the hybrid backend.
From the CLI/docs/generated bindings, hybrid usage looks like:
opendataloader-pdf --hybrid docling-fast file.pdf
and optionally:
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file.pdf
So the current warning points users to an option that is insufficient on its own.
Evidence
Warning text
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.java
CLI option semantics
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.java
content/docs/cli-options-reference.mdx
README.md
- generated Python/Node bindings (
convert_generated.py, convert-options.generated.ts)
Tests
CLIOptionsTest covers --hybrid-mode together with --hybrid, which matches the documented behavior.
Suggested Fix
Update the warning to recommend a complete actionable command, for example:
Consider using --hybrid docling-fast for OCR fallback.
- or, if needed,
Consider using --hybrid docling-fast --hybrid-mode full ...
A more explicit message might be:
Text extraction may be incomplete. Consider enabling hybrid OCR fallback with --hybrid docling-fast.
Expected Outcome
Users seeing this warning should be directed to the actual option that enables backend/OCR processing, not only the sub-mode selector.
Summary
When OpenDataLoader detects CID-keyed fonts without ToUnicode mappings, it logs a warning that says:
This is misleading because
--hybrid-modealone does not enable the hybrid backend. According to the CLI options and docs, users must also pass--hybrid <backend>(for example--hybrid docling-fast) for OCR/backend processing to happen at all.Current Behavior
The warning in
ContentFilterProcessorcurrently says:"Text extraction may be incomplete. Consider using --hybrid-mode for OCR fallback."Location:
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.javaWhy this is a problem
--hybrid-modeis only a mode selector (auto/full), not the switch that enables the hybrid backend.From the CLI/docs/generated bindings, hybrid usage looks like:
and optionally:
So the current warning points users to an option that is insufficient on its own.
Evidence
Warning text
java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.javaCLI option semantics
java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.javacontent/docs/cli-options-reference.mdxREADME.mdconvert_generated.py,convert-options.generated.ts)Tests
CLIOptionsTestcovers--hybrid-modetogether with--hybrid, which matches the documented behavior.Suggested Fix
Update the warning to recommend a complete actionable command, for example:
Consider using --hybrid docling-fast for OCR fallback.Consider using --hybrid docling-fast --hybrid-mode full ...A more explicit message might be:
Expected Outcome
Users seeing this warning should be directed to the actual option that enables backend/OCR processing, not only the sub-mode selector.