Skip to content

v4.0.0

Latest

Choose a tag to compare

@lance-community lance-community released this 30 Mar 18:08
· 161 commits to main since this release

What's Changed

Breaking Changes 🛠

New Features 🎉

  • feat: compress complex all null by @yingjianwu98 in #4990
  • feat: expose use_scalar_index param in Java scanner by @xloya in #5487
  • feat: add file list with sizes to IndexMetadata by @wjones127 in #5497
  • feat(compaction): add Python config for defer_index_remap by @zhangyue19921010 in #5691
  • feat(core): add Levenshtein-based suggestions to not-found errors in schema by @HemantSudarshan in #5976
  • feat: add URI-based commit support to Java SDK by @hamersaw in #5978
  • fix: concurrent read and write to directory namespace by @jackye1995 in #5983
  • feat: add ability to pass custom headers to objectstore requests by @hamersaw in #5989
  • feat: add DeleteResult with num_deleted_rows by @wkalt in #6001
  • feat: introduce IncompatibleTransaction error by @wjones127 in #6003
  • feat(cleanup): add more metrics to RemovalStats by @zhangyue19921010 in #6025
  • feat(java): expose prefilter parameter to support vector search with fragments by @nyl3532016 in #6040
  • feat: surface ambiguous merge insert error as InvalidInput by @wjones127 in #6048
  • feat(blob): distribute blob sidecar keys with reversed binary ids by @Xuanwo in #6060
  • feat: handle JSONB literals in Lance SQL planner by @wkalt in #6061
  • feat(java): expose Dataset.dropIndex method to drop specific index by @fangbo in #6065
  • feat(blob): map external blob URIs to multi-base base ids by @Xuanwo in #6066
  • feat: add env toggle for repetition index cache on read by @Xuanwo in #6069
  • feat(compaction): single reserve_fragment_ids after rewriting files by @hamersaw in #6072
  • feat: expose compaction binary copy configuration through python and java SDKs by @hamersaw in #6074
  • feat(cleanup): support rate limiter for cleanup operation by @zhangyue19921010 in #6084
  • feat: mark 2.2 as stable and add 2.3 as the next file format version by @Xuanwo in #6088
  • feat: support prewarm for IVF-based ANN indices by @wjones127 in #6090
  • feat: add skip_transpose flag to vector index builders by @BubbleCal in #6114
  • feat: enable HNSW-accelerated partition assignment for fp16 vectors by @wkalt in #6119
  • feat: clearer progress reporting for IVF by @wkalt in #6126
  • feat: support vector indices in describe_indices filtering by @ndpvt-web in #6145
  • feat: reduce open file handles during IVF training by @westonpace in #6169
  • feat: add compaction options in manifest config by @hamersaw in #6170
  • feat: support atomic multi-table transactions via namespace manifest by @XuQianJin-Stars in #6173
  • feat: add abfss:// scheme support for Azure ADLS Gen2 by @burlacio in #6192
  • feat: bounding source fragments for compaction execution by @hamersaw in #6232
  • fix: filter out detached versions when scanning manifests by @jackye1995 in #6245
  • feat: allow setting transaction properties in various operations by @jackye1995 in #6246
  • feat: add OpenDAL Azdls backend for abfss:// with use_opendal flag by @burlacio in #6256

Bug Fixes 🐛

  • fix(java): transaction fatal bug in java transaction api by @wojiaodoubao in #5824
  • fix: maintaining individual fragment operation when calling take_source by @hamersaw in #5844
  • fix(encoding): handle empty rows in variable packed struct decode by @Xuanwo in #5995
  • fix: various bugs to namespace access by @jackye1995 in #5996
  • fix: set namespace commit handler for LanceDataset.commit by @jackye1995 in #6002
  • fix: fast_search limits full text search to indexed fragments by @BubbleCal in #6006
  • fix: fast_search should ignore any unindexed data for vector search by @BubbleCal in #6007
  • fix: correctly calculate max visible level when a list has no def by @westonpace in #6008
  • perf: avoid oversized variable buffers in full-zip scan batches by @Xuanwo in #6013
  • fix: make overwrites retryable instead of compatible by @jackye1995 in #6014
  • fix(python): avoid interpreter shutdown panic in BackgroundExecutor by @Xuanwo in #6023
  • fix: filter stale row IDs in TakeExec for FTS/vector after delete by @wkalt in #6042
  • fix(btree): include null pages in non-IsNull queries for correct thre… by @wkalt in #6043
  • fix: handle list-level NULLs in NOT filters by @fenfeng9 in #6044
  • fix: allowing headers for static configuration to be consistent by @hamersaw in #6045
  • fix: bitmap iterator exhaustion in mask_to_offset_ranges by @wkalt in #6046
  • fix(build): add Android aarch64 support to lance-linalg by @dardourimohamed in #6057
  • fix: make blob v2 reads base-aware in multi-base datasets by @Xuanwo in #6064
  • fix(lance-linalg): fix missing return value in u8x16::bit_and for non-x86_64/aarch64 targets by @cheungxi in #6068
  • fix: resolve Python lint failure on main by @Xuanwo in #6073
  • fix: restore main CI by formatting take_blob imports by @Xuanwo in #6082
  • fix: incorrect deletion masking in DatasetPreFilter by @cijiugechu in #6083
  • fix: avoid thread pool contention between compression and write operations during FTS indexing by @BubbleCal in #6085
  • fix: compile error for err_express by @zhangyue19921010 in #6094
  • fix(python): crash when schema contains nested fixed_size_list or extension type by @erandagan in #6107
  • fix: dont sample if no vectors are needed by @westonpace in #6110
  • fix(index): preserve stable row-id entries during scalar index optimize by @acking-you in #6117
  • fix: disallow wrapping auto-detected fsst in other compression by @hamersaw in #6120
  • fix: pin substrait to 0.62.2 until DF supports 0.62.3 by @westonpace in #6121
  • fix: vector index type shown as unknown in describe_indices by @jackye1995 in #6122
  • fix: handle inverted index worker exits during dispatch by @BubbleCal in #6129
  • fix: add missing type hint for producer function by @Gallardot in #6133
  • fix: prevent duplicate manifest entries from concurrent table creation by @jmhsieh in #6143
  • fix: replace fetch_arrow_table with to_arrow_table by @BubbleCal in #6146
  • fix: preserve merge insert delete-by-source semantics by @Xuanwo in #6148
  • fix: handle DataType::Null in adjust_child_validity to prevent panic by @wjones127 in #6160
  • fix: persist frag reuse index external file on local filesystem by @wjones127 in #6163
  • fix: avoid empty range reads for zero-length blobs by @Xuanwo in #6168
  • fix: handle nullable validity layers without def levels by @Xuanwo in #6187
  • fix: like queries with a prefix should be accelerated by btree and zonemap by @jackye1995 in #6188
  • fix: use to_arrow_reader in benchmark datagen by @Xuanwo in #6190
  • fix: disallowing stale credentials from directory namespace by @hamersaw in #6194
  • fix: memory_limit and num_workers params are not passed to index worker by @BubbleCal in #6197
  • fix: preserve create index transaction semantics by @Xuanwo in #6204
  • fix: allow same field name with different type in dataset overwrites by @hamersaw in #6206
  • fix: prewarm all segments for named indices by @Xuanwo in #6211
  • fix: respect the old data filter on inverted index by @westonpace in #6216
  • fix: 2.1/2.2 panic when a list column had small values and many empty values by @westonpace in #6234
  • fix: resolve_latest_location converts errors to not_found unconditionally by @wkalt in #6248
  • fix: return errors for unsupported fixed-size-list child types by @myandpr in #6253
  • fix: adding namespace support to java SDK CommitBuilder from dataset by @hamersaw in #6257
  • fix: pass dataset_options to SafeLanceDataset in worker processes by @eddyxu in #6278

Documentation 📚

  • docs: fix incorrect URLs and cleanup by @prrao87 in #5317
  • docs: expand the FTS index doc explaining the training process and multiple partitions by @westonpace in #5988
  • docs: clarify v2.2 nested drop rollback risk by @Xuanwo in #5999
  • docs: require data_storage_version=2.2 in map type example by @Xuanwo in #6032
  • docs: update file versioning matrix for 2.2 rollout by @Xuanwo in #6033
  • docs: reorganize blob docs around blob v2 and clarify legacy compatibility by @Xuanwo in #6034
  • docs: align 2.2 encoding docs and nested add-column notes by @Xuanwo in #6038
  • docs: clarify how to generate TPCH benchmark dataset locally by @Xuanwo in #6063
  • docs: document vector index RAM (training) & storage requirements by @westonpace in #6108
  • docs: update index.md to fix indexes to indices for uniformity by @wombatu-kun in #6113
  • docs: document the rules for transaction conflicts by @westonpace in #6158
  • docs: add alicloud oss configuration by @FarmerChillax in #6167
  • docs: update the rules for data replacement conflicts to reflect reality by @westonpace in #6182
  • docs: add example to show how to index JSON column by @prrao87 in #6208
  • docs: remove legacy preview index note by @Xuanwo in #6218

Performance Improvements 🚀

  • perf: pre-transpose PQ codebook for SIMD-friendly L2 distance by @wkalt in #5923
  • perf: speed up format v2.2 scans by adding shortcut for full page by @Xuanwo in #5981
  • perf: speed up format 2.2 300% by spawning structural decode batch tasks by @Xuanwo in #5982
  • perf: reduce peak memory during cosine IVF-PQ index training by @wkalt in #6016
  • perf: fast rotation for RQ quantization by @BubbleCal in #6024
  • perf: avoid re-open shard indices and small reads by @BubbleCal in #6026
  • perf: disable auto FSST for binary fields by @Xuanwo in #6047
  • perf: speedup flat fts by @westonpace in #6054
  • perf: add dict-values compression controls with lz4 default by @Xuanwo in #6059
  • perf: avoid frequent allocating when computing residual vectors by @BubbleCal in #6062
  • perf: add take_blob benchmark with cache_repetition_index matrix by @Xuanwo in #6067
  • perf: parallelize FTS prewarming by @BubbleCal in #6144
  • perf: remove shard content key sorting from distributed merge by @Xuanwo in #6179
  • perf(inverted): reuse posting batch builder and merge tail partitions by @BubbleCal in #6191
  • perf: reuse distance calculator at selecting candidates by @BubbleCal in #6202
  • perf: new layout for positions and new algo for phrase query by @BubbleCal in #6203
  • perf: batched WAND and new WAND structure, ~50% faster by @BubbleCal in #6241

Other Changes

  • refactor: use dict entries and encoded size instead of cardinality for dict decision by @Xuanwo in #5891
  • refactor: upgrade to SNAFU 0.9 by @shepmaster in #6071
  • refactor: overhaul AGENTS.md with PR review insights by @Xuanwo in #6103
  • refactor: use the dataset file version to determine index file version by @westonpace in #6142
  • refactor: rename arrow-scalar to lance-arrow-scalar by @westonpace in #6199
  • refactor: distributed vector segment build by @Xuanwo in #6220

Full Changelog: release-root/4.0.0-beta.N...v4.0.0