Skip to content

Image extraction: Indexed color spaces with Lab / CalRGB / ICCBased base aren't colorimetrically converted #337

@yfedoseev

Description

@yfedoseev

Summary

resolve_indexed_palette() currently derives base_fmt by calling color_space_to_pixel_format(&base_cs), which maps several non-device color spaces (Lab, CalRGB, CalGray, and many ICCBased variants) to PixelFormat::RGB. expand_indexed_to_rgb() then reinterprets the raw palette bytes as already-RGB without running any colorimetric conversion, producing an output image whose colors are wrong in perceptually-uniform spaces but look "roughly right" because the byte layout happens to line up.

This is a real enhancement surfaced by Copilot review on #312. The common case (DeviceRGB / DeviceGray / DeviceCMYK base, which is what real-world PDFs actually use) works correctly — the #311 fix is strictly better than v0.3.23 on every file in the test corpus. This ticket tracks the follow-up work to handle the other base color spaces correctly.

Affected base color spaces

Per PDF 32000-1:2008 §8.6.6.3, the base of an Indexed color space can be any of:

base palette encoding status after #311
DeviceGray 1 byte / palette entry ✅ correct
DeviceRGB 3 bytes / palette entry ✅ correct
DeviceCMYK 4 bytes / palette entry ✅ correct (via cmyk_to_rgb converter)
CalGray 1 byte; A component + gamma ❌ treated as DeviceGray, no gamma correction
CalRGB 3 bytes; A/B/C components + gamma/matrix ❌ treated as DeviceRGB, no calibration
Lab 3 bytes; Lab* ❌ treated as RGB, wildly wrong colors
ICCBased N bytes matching ICC profile ❌ treated by its declared /Alternate channel count; no profile application
DeviceN / Separation 1+ bytes; requires tint transform ❌ falls through to whatever color_space_to_pixel_format returns

Proposed fix

Two passes depending on effort budget:

Phase 1: correctness on the Cal* / Lab paths

For CalGray, CalRGB, and Lab, the palette bytes are component values in a defined range. The spec gives exact conversion formulas (§8.6.5.3, §8.6.5.4). Implement them in a new cal_palette_to_rgb() / lab_palette_to_rgb() helper and dispatch from resolve_indexed_palette() based on base_cs. Colors become correct without touching the expander.

Phase 2: ICCBased

ICCBased is harder — a proper implementation applies the embedded ICC profile via lcms2-rs or qcms. Phase 2a is "fall back to the /Alternate color space" (§8.6.5.5), which typically gives DeviceRGB or DeviceCMYK and is already handled by phase 1. Phase 2b is full profile application.

Phase 3: DeviceN / Separation

Tint transforms are PostScript functions embedded in the PDF. Evaluating them requires a Function object parser. Pragmatic fallback: treat DeviceN as its /Alternate.

Acceptance criteria

  • Unit tests for Indexed + CalRGB (with non-identity gamma), Indexed + Lab (with a/b ≠ 0), and Indexed + ICCBased-with-Alternate that pin the corrected pixel output.
  • Existing v0.3.25 test corpus ([Bug]: Error: Image error: Invalid RGB image dimensions #311 fix) still passes byte-identically.
  • resolve_indexed_palette() no longer blindly routes through color_space_to_pixel_format() for non-Device bases.
  • Documented behavior for ICCBased with unknown profile: fall back to /Alternate rather than guessing.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions