Summary
expand_indexed_to_rgb() in src/extractors/images.rs has two lenient-parsing behaviors around bits_per_component (bpc) that silently produce wrong output instead of surfacing malformed input:
bpc.max(1) coerces bpc = 0 to 1 — so a PDF declaring /BitsPerComponent 0 (invalid per §8.9.5.1, which mandates 1/2/4/8 for Indexed) still allocates a bytes_per_row and returns pixels.
- The inner
match bpc { 1 => …, 2 => …, 4 => …, 8 => …, _ => 0 } arm silently maps every unsupported value to palette index 0, producing a solid-color-entry-0 image for bpc = 3, 5, 6, 7, 9, ….
pdf_oxide's stated parsing philosophy is lenient-on-malformed-input — return best-effort pixels and let the caller decide — so returning a wrong-but-non-crashing image is intentional. But silent fallback makes these conditions invisible to consumers who'd want to know the input was malformed (especially automated pipelines deciding whether to OCR the image).
Proposed minimal fix
Keep the lenient behavior but emit a log::warn! once per image when the fallback fires, so a caller running with RUST_LOG=pdf_oxide=warn can distinguish "extracted cleanly" from "extracted with guesswork":
let bpc = if !matches!(bpc, 1 | 2 | 4 | 8) {
log::warn!(
"Indexed image has unsupported bits-per-component={bpc} \
(spec allows 1/2/4/8). Pixels will be filled with palette entry 0; \
recommend re-extracting with --strict or treating the image as OCR fallback."
);
0u8 // sentinel that the match arm will handle via _ => 0
} else {
bpc
};
Plus a matching warning for bits_per_component = 0 before the .max(1) coercion.
Proposed stricter alternative (behind a feature / flag)
Add a strict mode to expand_indexed_to_rgb or the FFI caller where invalid bpc returns Error::Image("Indexed image bpc={bpc} is not in spec-allowed {1,2,4,8}"). Default stays lenient, opt-in strict for pipelines that want hard failures.
Acceptance criteria
Related
Summary
expand_indexed_to_rgb()insrc/extractors/images.rshas two lenient-parsing behaviors aroundbits_per_component(bpc) that silently produce wrong output instead of surfacing malformed input:bpc.max(1)coercesbpc = 0to1— so a PDF declaring/BitsPerComponent 0(invalid per §8.9.5.1, which mandates 1/2/4/8 for Indexed) still allocates abytes_per_rowand returns pixels.match bpc { 1 => …, 2 => …, 4 => …, 8 => …, _ => 0 }arm silently maps every unsupported value to palette index 0, producing a solid-color-entry-0 image forbpc = 3, 5, 6, 7, 9, ….pdf_oxide's stated parsing philosophy is lenient-on-malformed-input — return best-effort pixels and let the caller decide — so returning a wrong-but-non-crashing image is intentional. But silent fallback makes these conditions invisible to consumers who'd want to know the input was malformed (especially automated pipelines deciding whether to OCR the image).
Proposed minimal fix
Keep the lenient behavior but emit a
log::warn!once per image when the fallback fires, so a caller running withRUST_LOG=pdf_oxide=warncan distinguish "extracted cleanly" from "extracted with guesswork":Plus a matching warning for
bits_per_component = 0before the.max(1)coercion.Proposed stricter alternative (behind a feature / flag)
Add a
strictmode toexpand_indexed_to_rgbor the FFI caller where invalid bpc returnsError::Image("Indexed image bpc={bpc} is not in spec-allowed {1,2,4,8}"). Default stays lenient, opt-in strict for pipelines that want hard failures.Acceptance criteria
/Indexed … /BitsPerComponent 3surfaces a clearwarnlevel log line identifying the bpc and the image (name or XObject ref).log::set_loggertest fixture).Related
expand_indexed_to_rgbalready hasMAX_INDEXED_OUTPUT_BYTES/ truncation guards from Images: Indexed palette expander lacks overflow and truncation guards #324, so malformed bpc is the remaining soft-fail path.