Skip to content

LZ4_RAW heap decompression fails on chunked BytesInput materialization for large pages #3478

@arouel

Description

@arouel

Description

When reading LZ4_RAW-compressed data through the heap codec path, decompression can fail if the decompressed page is larger than the chunk size used by stream materialization (typically ~8 KiB via Channels.newChannel(...)).
This appears in paths that materialize BytesInput lazily (for example via BytesInput.copy(...) / toByteBuffer(...), including dictionary-filter related reads).

Error

io.airlift.compress.MalformedInputException: all input must be consumed: offset=2532
	at io.airlift.compress.lz4.Lz4RawDecompressor.decompress(Lz4RawDecompressor.java:89)
	at io.airlift.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:98)
	at org.apache.parquet.hadoop.codec.Lz4RawDecompressor.uncompress(Lz4RawDecompressor.java:39)
	at org.apache.parquet.hadoop.codec.NonBlockedDecompressor.decompress(NonBlockedDecompressor.java:81)
	at org.apache.parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
	at java.base/java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:318)
	at org.apache.parquet.bytes.BytesInput$StreamBytesInput.writeInto(BytesInput.java:384)
	at org.apache.parquet.bytes.BytesInput.copy(BytesInput.java:270)
	at org.apache.parquet.bytes.BytesInput.copy(BytesInput.java:280)
	at org.apache.parquet.hadoop.DictionaryPageReader.reusableCopy(DictionaryPageReader.java:113)
	at org.apache.parquet.hadoop.DictionaryPageReader.lambda$readDictionaryPage$0(DictionaryPageReader.java:104)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1735)
	at org.apache.parquet.hadoop.DictionaryPageReader.readDictionaryPage(DictionaryPageReader.java:97)
	at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.expandDictionary(DictionaryFilter.java:93)
	at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.visit(DictionaryFilter.java:160)
	at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.visit(DictionaryFilter.java:62)
	at org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:189)
	at org.apache.parquet.filter2.dictionarylevel.DictionaryFilter.canDrop(DictionaryFilter.java:72)
	at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:107)
	at org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:43)
	at org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:157)
	at org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:71)
	at org.apache.parquet.hadoop.ParquetFileReader.filterRowGroups(ParquetFileReader.java:1090)

Reproducer

  @Test
  public void lz4RawHeapDecompressorCanCopyLargePage() throws IOException {
    final int size = 16 * 1024;
    final byte[] raw = new byte[size];
    new Random(42).nextBytes(raw);

    try (TrackingByteBufferAllocator allocator = TrackingByteBufferAllocator.wrap(new DirectByteBufferAllocator());
        ByteBufferReleaser releaser = new ByteBufferReleaser(allocator)) {
      CodecFactory heapCodecFactory = new CodecFactory(new Configuration(), pageSize);
      BytesInputCompressor compressor = heapCodecFactory.getCompressor(LZ4_RAW);
      BytesInputDecompressor decompressor = heapCodecFactory.getDecompressor(LZ4_RAW);

      BytesInput compressed = compressor.compress(BytesInput.from(raw));
      BytesInput decompressed = decompressor.decompress(compressed, size);

      // Regression coverage: previously this copy path hit StreamBytesInput.writeInto(...),
      // which reads via Channels.newChannel(...) in 8KB chunks and failed for LZ4_RAW.
      BytesInput copied = decompressed.copy(releaser);
      Assert.assertArrayEquals(raw, copied.toByteArray());

      compressor.release();
      decompressor.release();
      heapCodecFactory.release();
    }
  }

Suspected root cause

  • Lz4RawDecompressor.maxUncompressedLength(...) returns caller len (requested read size), not true page size.
  • NonBlockedDecompressor.decompress(...) uses that estimate to allocate output and performs one-shot decompression.
  • In chunked stream reads, first call may request ~8 KiB even when page is larger, leading to incorrect decompression state and subsequent zero-byte read error.

Impact

  • Correctness/readability bug for LZ4_RAW in heap decompression path.
  • It break real reads (not just tests), especially when pages exceed chunk size.

Version

1.17.0 (older versions can also be affected)

Component(s)

Core

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions