Skip to content

Optimizing Variant read path with lazy caching#3481

Open
nssalian wants to merge 1 commit intoapache:masterfrom
nssalian:variant-read-changes
Open

Optimizing Variant read path with lazy caching#3481
nssalian wants to merge 1 commit intoapache:masterfrom
nssalian:variant-read-changes

Conversation

@nssalian
Copy link
Copy Markdown

@nssalian nssalian commented Apr 15, 2026

Rationale for this change

Profiling in 3452 identified Variant.getFieldAtIndex() and metadata string lookups as hotspots during variant reads. Every call to getFieldByKey, getFieldAtIndex, and getElementAtIndex re-parses headers and re-allocates objects that could be cached.

What changes are included in this PR?

Adds lazy caching to Variant.java for metadata strings, object headers, and array headers. Field lookups in getFieldByKey now defer value construction until a match is found, and child Variants share the parent's metadata cache. Also removes two unused static helper methods.

Includes @steveloughran's string converter optimization from #3452: VariantBuilder.appendAsString(Binary) and its use in VariantConverters.

Are these changes tested?

Ran the benchmarks from 3452 locally

Before:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt      Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  11248.133 ±  696.176  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  15531.391 ± 1025.506  us/op

After:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  4601.967 ± 4434.474  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  7457.942 ± 3645.281  us/op

Most recent benchmarks after the thread safety changes

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  6142.534 ± 2839.243  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  8013.900 ± 2725.291  us/op

Are there any user-facing changes?

No.

@nssalian nssalian marked this pull request as ready for review April 15, 2026 15:00
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good; made some minor changes.

This should make a very big difference when selectively retrieving multiple fields within a single variant, or within a variant and nested children.

I do worry about concurrency now. The existing Variant didn't have issues here precisely because it recalculated everything.

We have to be confident that even if concurrent access triggers a duplicate cache operation, there's no harm in this. Otherwise cache access will have to be synchronized.

It all looks good to me.

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 15, 2026

I started the workflows

Co-authored-by: Steve Loughran <stevel@cloudera.com>
@nssalian nssalian force-pushed the variant-read-changes branch from 6c6db2e to 6f540f4 Compare April 15, 2026 18:05
@nssalian nssalian requested a review from steveloughran April 15, 2026 18:08
@nssalian nssalian changed the title WIP: Optimizing Variant read path with lazy caching Optimizing Variant read path with lazy caching Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants