Description
When reading LZ4_RAW-compressed data through the heap codec path, decompression can fail if the decompressed page is larger than the chunk size used by stream materialization (typically ~8 KiB via Channels.newChannel(...)).
This appears in paths that materialize BytesInput lazily (for example via BytesInput.copy(...) / toByteBuffer(...), including dictionary-filter related reads).
Error
io.airlift.compress.MalformedInputException: all input must be consumed: offset=2532
at io.airlift.compress.lz4.Lz4RawDecompressor.decompress(Lz4RawDecompressor.java:89)
at io.airlift.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:98)
at org.apache.parquet.hadoop.codec.Lz4RawDecompressor.uncompress(Lz4RawDecompressor.java:39)
at org.apache.parquet.hadoop.codec.NonBlockedDecompressor.decompress(NonBlockedDecompressor.java:81)
at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
at java.base/java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:318)
at org.apache.parquet.bytes.BytesInput$StreamBytesInput.writeInto(BytesInput.java:384)
at org.apache.parquet.bytes.BytesInput.copy(BytesInput.java:270)
at org.apache.parquet.bytes.BytesInput.copy(BytesInput.java:280)
at org.apache.parquet.hadoop.DictionaryPageReader.reusableCopy(DictionaryPageReader.java:113)
at org.apache.parquet.hadoop.DictionaryPageReader.lambda$readDictionaryPage$0(DictionaryPageReader.java:104)
at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1735)
at org.apache.parquet.hadoop.DictionaryPageReader.readDictionaryPage(DictionaryPageReader.java:97)
at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.expandDictionary(DictionaryFilter.java:93)
at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.visit(DictionaryFilter.java:160)
at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.visit(DictionaryFilter.java:62)
at org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:189)
at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.canDrop(DictionaryFilter.java:72)
at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:107)
at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:43)
at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:157)
at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:71)
at org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:1090)
Reproducer
@Test
public void lz4RawHeapDecompressorCanCopyLargePage() throws IOException {
final int size = 16 * 1024;
final byte[] raw = new byte[size];
new Random(42).nextBytes(raw);
try (TrackingByteBufferAllocator allocator = TrackingByteBufferAllocator.wrap(new DirectByteBufferAllocator());
ByteBufferReleaser releaser = new ByteBufferReleaser(allocator)) {
CodecFactory heapCodecFactory = new CodecFactory(new Configuration(), pageSize);
BytesInputCompressor compressor = heapCodecFactory.getCompressor(LZ4_RAW);
BytesInputDecompressor decompressor = heapCodecFactory.getDecompressor(LZ4_RAW);
BytesInput compressed = compressor.compress(BytesInput.from(raw));
BytesInput decompressed = decompressor.decompress(compressed, size);
// Regression coverage: previously this copy path hit StreamBytesInput.writeInto(...),
// which reads via Channels.newChannel(...) in 8KB chunks and failed for LZ4_RAW.
BytesInput copied = decompressed.copy(releaser);
Assert.assertArrayEquals(raw, copied.toByteArray());
compressor.release();
decompressor.release();
heapCodecFactory.release();
}
}
Suspected root cause
Lz4RawDecompressor.maxUncompressedLength(...) returns caller len (requested read size), not true page size.
NonBlockedDecompressor.decompress(...) uses that estimate to allocate output and performs one-shot decompression.
- In chunked stream reads, first call may request ~8 KiB even when page is larger, leading to incorrect decompression state and subsequent zero-byte read error.
Impact
- Correctness/readability bug for
LZ4_RAW in heap decompression path.
- It break real reads (not just tests), especially when pages exceed chunk size.
Version
1.17.0 (older versions can also be affected)
Component(s)
Core
Description
When reading
LZ4_RAW-compressed data through the heap codec path, decompression can fail if the decompressed page is larger than the chunk size used by stream materialization (typically ~8 KiB viaChannels.newChannel(...)).This appears in paths that materialize
BytesInputlazily (for example viaBytesInput.copy(...)/toByteBuffer(...), including dictionary-filter related reads).Error
Reproducer
Suspected root cause
Lz4RawDecompressor.maxUncompressedLength(...)returns callerlen(requested read size), not true page size.NonBlockedDecompressor.decompress(...)uses that estimate to allocate output and performs one-shot decompression.Impact
LZ4_RAWin heap decompression path.Version
1.17.0 (older versions can also be affected)
Component(s)
Core