Skip to content

disttae: add workspace safety valve to prevent OOM during HNSW index creation#24148

Open
aptend wants to merge 3 commits intomatrixorigin:mainfrom
aptend:fix/workspace-dump-safety-valve
Open

disttae: add workspace safety valve to prevent OOM during HNSW index creation#24148
aptend wants to merge 3 commits intomatrixorigin:mainfrom
aptend:fix/workspace-dump-safety-valve

Conversation

@aptend
Copy link
Copy Markdown
Contributor

@aptend aptend commented Apr 16, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24041

What this PR does / why we need it:

When IncrStatementID is disabled (e.g., HNSW index creation via RunSql with WithDisableIncrStatement), the primary workspace dump entry point (IncrStatementID → dumpBatchLocked(ctx, 0)) is skipped. The write-path dump (dumpBatchLocked(ctx, snapshotWriteOffset)) becomes the only check, but it only scans the current statement's writes from snapshotWriteOffset. Since each INSERT (~131MB) is below the quota-raised writeWorkspaceThreshold (136MB), dump never triggers. Memory accumulates unbounded until OOM.

Root cause: The workspace dump mechanism has three entry points designed for different roles:

  1. IncrStatementID (offset=0): Primary — full workspace scan at statement boundaries
  2. Write path (offset=snapshotWriteOffset): Safety net — windowed scan for single-statement overflow
  3. Commit (offset=-1): Final cleanup — dump everything

When the primary entry point is disabled, the safety net's windowed view cannot detect global accumulation. Combined with quota "free-riding" (only the first INSERT acquires quota, subsequent INSERTs bypass quota because each individual write < raised threshold), the workspace grows without bound.

Fix: Add a global safety valve (extraWorkspaceThreshold, default 500MB). When the write-path scan finds current writes below threshold but approximateInMemInsertSize >= extraWorkspaceThreshold:

  • Set offset=0 to dump from the beginning of the workspace
  • Bypass quota check entirely (force dump)
  • After dump: quota is released, thresholds reset, writes compacted

This ensures workspace memory stays bounded even when IncrStatementID is disabled.

Test: Added Test_WorkspaceForceDumpOnGlobalAccumulation that simulates the HNSW scenario (no IncrStatementID, repeated writes with advancing snapshotWriteOffset) and verifies:

  • The safety valve triggers at least once
  • In-memory size stays bounded
  • All written data is readable after commit

Copilot AI review requested due to automatic review settings April 16, 2026 10:47
@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Apr 16, 2026
@mergify mergify bot added the kind/bug Something isn't working label Apr 16, 2026
@aptend aptend force-pushed the fix/workspace-dump-safety-valve branch from f840cea to 79317b0 Compare April 16, 2026 10:49
When the safety valve triggers from the write path, use the current
statement's start offset (offsets[statementID-1]) instead of 0. This
avoids compacting prior statements' entries, which could break the
offsets[] array used by RollbackLastStatement.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gouhongshen
Copy link
Copy Markdown
Contributor

🔍 Code Review Report — 七人陪审团

📊 概览

指标
审查文件数 5
总发现数 12 (去重后)
🔴 Must Fix 2
🟡 Should Fix 6
🟢 Nit 4
🧨 破坏性测试结论 不足

📝 总结

本 PR 在 dumpBatchLocked 中添加安全阀防止 HNSW 索引创建时 OOM,初衷正确但实现存在关键逻辑缺陷:安全阀触发后代码 fall-through 到 quota 检查,若 quota 获取成功则 return nil 不执行 dump,安全阀的"强制 dump"语义被架空。该路径在典型 HNSW 场景下不易触发(prior statements 已 dump → scanInMemInsertSize(stmtStart)approximateInMemInsertSizeextraWorkspaceThreshold → 跳过 quota),但存在理论可触发的边界条件(prior statements 有未 dump 残留 → size gap → quota 成功),违反安全阀的保证性契约。此外,测试覆盖严重不足——测试参数与真实 HNSW 场景不匹配,存在假绿风险。

🧨 破坏性测试审判

  • 结论: 不足
  • 已覆盖失败模式: size < writeWorkspaceThreshold + 安全阀触发 → dump 执行
  • 缺失失败模式:
    1. size >= writeWorkspaceThreshold + quota 失败 → 安全阀触发路径
    2. statementID == 0(HNSW 禁用 IncrStatementID 的真实场景)
    3. 安全阀触发后 quota 成功 → 是否跳过 dump(即 Must Fix Add README.md #1 的 bug 路径)
    4. RollbackLastStatement 在 force dump 后 offsets[] 的一致性
    5. quota acquire 后 writeWorkspaceThreshold 被永久提高 → 后续安全阀是否可再触发

🔴 Must Fix

1. 安全阀 fall-through 到 quota 检查,可能跳过 dump — dumpBatchLocked

类别: 逻辑缺陷 / 安全阀契约违反

问题: 安全阀触发后重设 offset=stmtStartsize=scanInMemInsertSize(stmtStart),但代码未跳过后续 quota 检查。若 scanInMemInsertSize(stmtStart) < extraWorkspaceThreshold(因 prior statements 有未 dump 残留),quota 可能获取成功 → return nil → 不 dump → 内存继续累积 → 安全阀失效。

典型 HNSW 场景下不易触发,但存在理论可触发的边界条件,且违反安全阀"强制 dump"的语义契约。

根治方案: 安全阀触发后,跳过 quota 检查,直接进入 dump 逻辑。推荐提取为独立函数使控制流显式清晰:

// 在安全阀分支内,直接跳到 dump 逻辑,不再 fall-through 到 quota 检查
if txn.approximateInMemInsertSize >= txn.engine.config.extraWorkspaceThreshold {
    stmtStart := 0
    if txn.statementID > 0 {
        stmtStart = txn.offsets[txn.statementID-1]
    }
    logutil.Info("WORKSPACE-FORCE-DUMP", ...)
    // 直接用 offset = stmtStart 进入 dump,跳过 quota 检查
    // 可用 goto、提取函数、或设置 forceDump flag
}

修复此问题后,uint64 下溢风险和参数篡改控制流异味自动消除。

2. 测试覆盖不匹配真实 HNSW 场景

类别: 测试覆盖缺口

问题: 当前测试 writeWorkspaceThreshold=100,每次写入 size < 100 → 走短路分支触发安全阀。但真实 HNSW 场景(每个 INSERT ~131MB)中 size >= writeWorkspaceThreshold → 走 quota-acquire 分支 → acquire 成功后 threshold 被抬高 → 后续写入走短路分支。测试只模拟了后半段,未模拟前半段(quota-acquire 分支)

此外:

  • statementID==0(真实 HNSW 场景 IncrStatementID 被 disable)边界条件未测试
  • RollbackLastStatement 在 force dump 后的 offsets[] 一致性未验证
  • forceDumpTriggered 通过间接推断而非直接验证

根治方案: Must Fix #1 修复后,新增测试覆盖:

  1. 安全阀触发后必定执行 dump(无论 quota 状态)
  2. statementID==0 边界
  3. 使用接近真实 HNSW 的参数

🟡 Should Fix

1. 安全阀路径 2×O(n) 精确扫描冗余

已有 O(1) approximateInMemInsertSize 可用于阈值判断,安全阀判断可直接用近似值,只在 dump 时由 dumpInsertBatchLocked 做一次遍历。

2. quota acquire 永久提高 writeWorkspaceThreshold

多次 quota 获取后 threshold 可超过 extraWorkspaceThreshold,使安全阀被永久绕过。Must Fix #1 修复后安全阀路径不受影响,但正常路径的 threshold 上升仍影响短路判断。

3. statementID==0 + stmtStart==0 边界条件未测试

真实 HNSW 场景 IncrStatementID 被 disable → statementID==0stmtStart==0,但测试中 statementID=1

4. RollbackLastStatement 在 force dump 后未测试

PR 注释声称避免破坏 offsets[],但无测试证据验证。

5. forceDumpTriggered 断言为间接推断

通过 inMemSize 下降推断而非直接验证安全阀标志位,存在假绿风险。

6. 补丁链过程异味

第二个 commit 修补第一个 commit 的 offset=0 错误。根治与 Must Fix #1 统一——将安全阀逻辑解耦为早期决策。


🟢 Nit

  1. WithExtraWorkspaceThreshold vs WithExtraWorkspaceThresholdQuota 命名混淆 — 建议重命名后者为 WithWorkspaceQuota
  2. scanInMemInsertSize 注释说 WHAT 不说 WHY
  3. extraWorkspaceThreshold 双重语义(quota上限 vs 全局硬限)— 长期应拆分,当前 PR 不强制
  4. 测试名称过于泛化

建议: 修复 Must Fix #1#2 后,此 PR 可以合并。Must Fix #1 的修复同时消除 3 个关联问题(下溢风险 + 参数篡改异味 + 补丁链根源),一处改动收效最大。

@aptend
Copy link
Copy Markdown
Contributor Author

aptend commented Apr 17, 2026

🔴 Must Fix

1. 安全阀 fall-through 到 quota 检查,可能跳过 dump — dumpBatchLocked

类别: 逻辑缺陷 / 安全阀契约违反

问题: 安全阀触发后重设 offset=stmtStartsize=scanInMemInsertSize(stmtStart),但代码未跳过后续 quota 检查。若 scanInMemInsertSize(stmtStart) < extraWorkspaceThreshold(因 prior statements 有未 dump 残留),quota 可能获取成功 → return nil → 不 dump → 内存继续累积 → 安全阀失效。

典型 HNSW 场景下不易触发,但存在理论可触发的边界条件,且违反安全阀"强制 dump"的语义契约。

根治方案: 安全阀触发后,跳过 quota 检查,直接进入 dump 逻辑。推荐提取为独立函数使控制流显式清晰:

// 在安全阀分支内,直接跳到 dump 逻辑,不再 fall-through 到 quota 检查
if txn.approximateInMemInsertSize >= txn.engine.config.extraWorkspaceThreshold {
    stmtStart := 0
    if txn.statementID > 0 {
        stmtStart = txn.offsets[txn.statementID-1]
    }
    logutil.Info("WORKSPACE-FORCE-DUMP", ...)
    // 直接用 offset = stmtStart 进入 dump,跳过 quota 检查
    // 可用 goto、提取函数、或设置 forceDump flag
}

修复此问题后,uint64 下溢风险和参数篡改控制流异味自动消除。

2. 测试覆盖不匹配真实 HNSW 场景

类别: 测试覆盖缺口

问题: 当前测试 writeWorkspaceThreshold=100,每次写入 size < 100 → 走短路分支触发安全阀。但真实 HNSW 场景(每个 INSERT ~131MB)中 size >= writeWorkspaceThreshold → 走 quota-acquire 分支 → acquire 成功后 threshold 被抬高 → 后续写入走短路分支。测试只模拟了后半段,未模拟前半段(quota-acquire 分支)

此外:

  • statementID==0(真实 HNSW 场景 IncrStatementID 被 disable)边界条件未测试
  • RollbackLastStatement 在 force dump 后的 offsets[] 一致性未验证
  • forceDumpTriggered 通过间接推断而非直接验证

根治方案: Must Fix #1 修复后,新增测试覆盖:

  1. 安全阀触发后必定执行 dump(无论 quota 状态)
  2. statementID==0 边界
  3. 使用接近真实 HNSW 的参数

fixed

Add forceDump flag so that when the safety valve triggers, the quota
check is bypassed entirely. This prevents a theoretical scenario where
quota acquisition succeeds and returns nil without dumping, defeating
the safety valve's guarantee.

Also add Test_WorkspaceForceDumpNoIncrStatement to cover the real HNSW
scenario where IncrStatementID is never called (statementID==0).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants