Skip to content

Image extraction: Indexed color space falls back to "Invalid RGB image dimensions" when palette resolution returns Ok(None) #336

@yfedoseev

Description

@yfedoseev

Summary

When resolve_indexed_palette() returns Ok(None) — a recoverable "can't parse this Indexed color space" signal — the else-branch in extract_image_from_xobject() falls through to the pre-v0.3.25 code path that maps ColorSpace::IndexedPixelFormat::RGB. The raw index stream (1 byte/px) is then reinterpreted as RGB (3 bytes/px), reproducing the exact "Invalid RGB image dimensions" failure mode that #311 was meant to eliminate on a narrower set of Indexed shapes.

This is an edge case of the #311 fix, discovered via Copilot review of #312. Not a regression against v0.3.23 — the pre-v0.3.25 code failed on every Indexed image; v0.3.25 fails only on Indexed color spaces whose lookup component isn't a String or Stream. But for those, the error message is misleading and the fallback produces garbage pixels.

Reproducer

Any PDF with an Indexed color space where the lookup element (arr[3]) is neither Object::String nor Object::Stream (e.g. Object::Array of hex bytes, an indirect reference resolved to something else, or a malformed dict). Testable synthetically via a minimal PDF generator that emits [/Indexed /DeviceRGB 255 <... array of bytes ...>].

Root cause

src/extractors/images.rs:

// resolve_indexed_palette (~line 757)
let mut palette_bytes = match &lookup_obj {
    Object::String(s) => s.clone(),
    Object::Stream { .. } => lookup_obj.decode_stream_data()?,
    _ => return Ok(None),   // <-- this path
};

Callers treat Ok(None) as "not an Indexed color space" and fall through:

// extract_image_from_xobject (~line 658)
if let Some((base_fmt, palette)) = indexed_palette.as_ref() {
    // fast path
} else {
    let pixel_format = color_space_to_pixel_format(&color_space);
    ImageData::Raw { pixels: decoded_data, format: pixel_format }
    //                                            ^^^^^^^^^^^^^
    //    color_space is still ColorSpace::Indexed → maps to RGB
    //    but `decoded_data` is 1 byte/px (palette indices) not 3 bytes/px
}

Proposed fix

Stop tunneling two different failure modes through Ok(None). Either:

A. Tighter contract on the helper. Keep Ok(None) only for "not an Array / array length < 4 / not Indexed" (genuine "not my problem") and return Err(Error::Image("Indexed palette unresolved: <reason>")) for every shape that is an Indexed array but can't be parsed (lookup isn't String/Stream, palette empty, base color space parse failed). The caller's else-branch is then unreachable for Indexed inputs.

B. Defensive caller. Keep the helper's contract but add an explicit check before the fall-through:

if matches!(color_space, ColorSpace::Indexed) && indexed_palette.is_none() {
    return Err(Error::Image(format!(
        "Indexed color space present but palette could not be resolved \
         (raw stream = {} bytes). PDF may be malformed or use an \
         unsupported lookup encoding.",
        decoded_data.len()
    )));
}

Option A is cleaner long-term (no in-band signaling). Option B is a one-line fix if we want to minimize #311 churn.

Acceptance criteria

  • extract_image_from_xobject on an Indexed color space whose lookup isn't a String/Stream returns Error::Image with a message that identifies the palette-resolution failure, not "Invalid RGB image dimensions".
  • Unit test with a synthetic Indexed [/DeviceRGB 255 [[0, 0, 0], [255, 255, 255]]] shape (lookup as an Array) that currently falls through — pin the new error path.
  • No regression on the real [Bug]: Error: Image error: Invalid RGB image dimensions #311 corpus (Charltsing/report.pdf still extracts all 218 images cleanly).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions