Skip to content

Image extraction: silent fallback on invalid Indexed bits-per-component (0 or not in {1,2,4,8}) #338

@yfedoseev

Description

@yfedoseev

Summary

expand_indexed_to_rgb() in src/extractors/images.rs has two lenient-parsing behaviors around bits_per_component (bpc) that silently produce wrong output instead of surfacing malformed input:

  1. bpc.max(1) coerces bpc = 0 to 1 — so a PDF declaring /BitsPerComponent 0 (invalid per §8.9.5.1, which mandates 1/2/4/8 for Indexed) still allocates a bytes_per_row and returns pixels.
  2. The inner match bpc { 1 => …, 2 => …, 4 => …, 8 => …, _ => 0 } arm silently maps every unsupported value to palette index 0, producing a solid-color-entry-0 image for bpc = 3, 5, 6, 7, 9, ….

pdf_oxide's stated parsing philosophy is lenient-on-malformed-input — return best-effort pixels and let the caller decide — so returning a wrong-but-non-crashing image is intentional. But silent fallback makes these conditions invisible to consumers who'd want to know the input was malformed (especially automated pipelines deciding whether to OCR the image).

Proposed minimal fix

Keep the lenient behavior but emit a log::warn! once per image when the fallback fires, so a caller running with RUST_LOG=pdf_oxide=warn can distinguish "extracted cleanly" from "extracted with guesswork":

let bpc = if !matches!(bpc, 1 | 2 | 4 | 8) {
    log::warn!(
        "Indexed image has unsupported bits-per-component={bpc} \
         (spec allows 1/2/4/8). Pixels will be filled with palette entry 0; \
         recommend re-extracting with --strict or treating the image as OCR fallback."
    );
    0u8  // sentinel that the match arm will handle via _ => 0
} else {
    bpc
};

Plus a matching warning for bits_per_component = 0 before the .max(1) coercion.

Proposed stricter alternative (behind a feature / flag)

Add a strict mode to expand_indexed_to_rgb or the FFI caller where invalid bpc returns Error::Image("Indexed image bpc={bpc} is not in spec-allowed {1,2,4,8}"). Default stays lenient, opt-in strict for pipelines that want hard failures.

Acceptance criteria

  • A PDF with /Indexed … /BitsPerComponent 3 surfaces a clear warn level log line identifying the bpc and the image (name or XObject ref).
  • Default behavior is unchanged — no new errors on existing corpus.
  • Unit test asserting the warning fires for the fallback path (via log::set_logger test fixture).
  • Docstring updated to document both the warn emission and the fallback semantics.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions