Skip to content

Warning suggests --hybrid-mode for OCR fallback, but hybrid backend is not enabled without --hybrid #404

@LocNguyenSGU

Description

@LocNguyenSGU

Summary

When OpenDataLoader detects CID-keyed fonts without ToUnicode mappings, it logs a warning that says:

Text extraction may be incomplete. Consider using --hybrid-mode for OCR fallback.

This is misleading because --hybrid-mode alone does not enable the hybrid backend. According to the CLI options and docs, users must also pass --hybrid <backend> (for example --hybrid docling-fast) for OCR/backend processing to happen at all.

Current Behavior

The warning in ContentFilterProcessor currently says:

"Text extraction may be incomplete. Consider using --hybrid-mode for OCR fallback."

Location:

  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.java

Why this is a problem

--hybrid-mode is only a mode selector (auto / full), not the switch that enables the hybrid backend.

From the CLI/docs/generated bindings, hybrid usage looks like:

opendataloader-pdf --hybrid docling-fast file.pdf

and optionally:

opendataloader-pdf --hybrid docling-fast --hybrid-mode full file.pdf

So the current warning points users to an option that is insufficient on its own.

Evidence

Warning text

  • java/opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/processors/ContentFilterProcessor.java

CLI option semantics

  • java/opendataloader-pdf-cli/src/main/java/org/opendataloader/pdf/cli/CLIOptions.java
  • content/docs/cli-options-reference.mdx
  • README.md
  • generated Python/Node bindings (convert_generated.py, convert-options.generated.ts)

Tests

CLIOptionsTest covers --hybrid-mode together with --hybrid, which matches the documented behavior.

Suggested Fix

Update the warning to recommend a complete actionable command, for example:

  • Consider using --hybrid docling-fast for OCR fallback.
  • or, if needed, Consider using --hybrid docling-fast --hybrid-mode full ...

A more explicit message might be:

Text extraction may be incomplete. Consider enabling hybrid OCR fallback with --hybrid docling-fast.

Expected Outcome

Users seeing this warning should be directed to the actual option that enables backend/OCR processing, not only the sub-mode selector.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions