retry: add backoff and bounds to unbounded retry loops by XuPeng-SH · Pull Request #24107 · matrixorigin/matrixone

XuPeng-SH · 2026-04-09T11:58:11Z

What type of PR is this?

Which issue(s) this PR fixes

Follow-up to #24105. Addresses remaining unbounded retry loops discovered during that review.

What this PR does

Three retry patterns in the codebase spin without sleep or upper bound, causing CPU waste and potential cascading failures when remote services are temporarily unavailable:

1. `lock_table_remote.go` — `unlock()` / `getLock()` tight loops

Both methods retry in a bare for {} loop with NO sleep between iterations. When handleError() returns a non-nil error (remote lock service unreachable, bind unchanged), the loop spins at full CPU speed.

Fix: Add exponential backoff (100ms → 5s cap) with a 30-second maximum retry duration. If the bind changes (handleError returns nil), the loop still exits immediately as before.

2. `txn/service/service.go` — `parallelSendWithRetry()` tight loop

Already bounded by ctx.Done(), but the continue on sender.Send() failure loops immediately with no sleep, hammering the sender during outages.

Fix: Add exponential backoff (100ms → 1s cap), reset on success.

3. `lockop/lock_op.go` — `getRetryWaitDuration()` budget=0 guard

PR #24105 introduced defaultMaxWaitTimeOnRetryBackendLock to bound backend error retries. However, when this budget is set to 0 (disabled), the guard return 0, false disables ALL retries — including normal retryable errors like ErrLockTableBindChanged.

Fix: When budget ≤ 0, only fail-fast for backend availability errors (isBoundedRetryLockError). Normal retryable errors still use defaultWaitTimeOnRetryLock. Added a test to verify ErrLockTableBindChanged still retries normally when budget=0.

Test verification

All existing TestLockWithRetry* tests pass (11/11)
New TestLockWithRetryStillRetriesBindChangeWhenBackendBudgetDisabled test added and passing
Test_parallelSendWithRetry passes
TestRemoteLockFailedInRollingRestartCN passes
Full build succeeds

Three retry patterns in the codebase spin without sleep or upper bound, causing CPU waste and potential cascading failures when remote services are temporarily unavailable. 1. lock_table_remote.go unlock()/getLock(): add exponential backoff (100ms->5s) with 30s deadline to prevent tight-loop CPU spinning when the remote lock service is unreachable. 2. txn/service/service.go parallelSendWithRetry(): add exponential backoff (100ms->1s) between Send failures, reset on success. Already bounded by ctx.Done(), but the tight loop hammers the sender unnecessarily during outages. 3. lockop/lock_op.go getRetryWaitDuration(): fix the budget<=0 guard from PR matrixorigin#24105 so that disabling backend retry budget (budget=0) only prevents retries for backend availability errors (ErrBackendCannotConnect, ErrBackendClosed, etc), not for all errors. Normal retryable errors like ErrLockTableBindChanged still retry normally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

XuPeng-SH

Self-Review

Overall: Clean — all three fixes are well-scoped and address real CPU-spinning hotspots

Changes reviewed:

lock_table_remote.go: exponential backoff + 30s deadline for unlock()/getLock()
txn/service/service.go: context-aware backoff for parallelSendWithRetry()
lockop/lock_op.go: budget=0 guard fix + new test

Issues found and fixed during review

1. [MEDIUM → FIXED] parallelSendWithRetry: time.Sleep was not context-aware

Original code used time.Sleep(backoff) which blocks for up to 1s even if ctx is cancelled. Since this is used in TN recovery and rollback forwarding paths, prompt cancellation on shutdown matters.

Fixed by replacing with time.NewTimer+select pattern that exits immediately on context cancellation.

2. [LOW → FIXED] unlock/getLock error logs lacked identifiers

The "retry budget exhausted" log messages only had budget duration. Added table-id and txn fields for debuggability.

Design note: `unlock` "must ensure" contract vs 30s deadline

The comment at line 176 says "unlock cannot fail and must ensure that all locks have been released." The 30s deadline weakens this guarantee. In practice, handleError → getLockTableBind detects bind changes well before 30s (the allocator reassigns the lock table, and bindChangedHandler releases local state). The 30s is a safety net for the edge case where:

Remote lock service is unreachable
Allocator still reports the same bind (hasn't timed out the service yet)
getLockTableBind also fails (returns oldError)

In this pathological case, spinning indefinitely at 100% CPU is worse than giving up after 30s. The lock table bind will eventually change via the allocator's health checks, at which point the lock cleanup happens through the normal bind-change path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

XuPeng-SH requested review from aunjgr, fengttt, iamlinjunhong and ouyuanning as code owners April 9, 2026 11:58

XuPeng-SH had a problem deploying to ci April 9, 2026 11:58 — with GitHub Actions Error

XuPeng-SH temporarily deployed to ci April 9, 2026 11:58 — with GitHub Actions Inactive

XuPeng-SH had a problem deploying to ci April 9, 2026 11:58 — with GitHub Actions Error

XuPeng-SH temporarily deployed to ci April 9, 2026 11:58 — with GitHub Actions Inactive

XuPeng-SH had a problem deploying to ci April 9, 2026 11:58 — with GitHub Actions Error

matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Apr 9, 2026

mergify bot added the kind/bug Something isn't working label Apr 9, 2026

XuPeng-SH force-pushed the fix-unbounded-retries branch from 95d68d1 to 69a727f Compare April 9, 2026 12:04

XuPeng-SH commented Apr 9, 2026

View reviewed changes

XuPeng-SH temporarily deployed to ci April 9, 2026 12:06 — with GitHub Actions Inactive

XuPeng-SH had a problem deploying to ci April 9, 2026 12:06 — with GitHub Actions Failure

aunjgr approved these changes Apr 14, 2026

View reviewed changes

test: cover retry backoff branches

76ed2b5

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

XuPeng-SH temporarily deployed to ci April 15, 2026 10:25 — with GitHub Actions Inactive

XuPeng-SH had a problem deploying to ci April 15, 2026 10:25 — with GitHub Actions Failure

XuPeng-SH temporarily deployed to ci April 15, 2026 10:25 — with GitHub Actions Inactive

XuPeng-SH had a problem deploying to ci April 15, 2026 10:25 — with GitHub Actions Failure

XuPeng-SH temporarily deployed to ci April 15, 2026 10:25 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry: add backoff and bounds to unbounded retry loops#24107

retry: add backoff and bounds to unbounded retry loops#24107
XuPeng-SH wants to merge 2 commits intomatrixorigin:3.0-devfrom
XuPeng-SH:fix-unbounded-retries

XuPeng-SH commented Apr 9, 2026

Uh oh!

XuPeng-SH left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

XuPeng-SH commented Apr 9, 2026

What type of PR is this?

Which issue(s) this PR fixes

What this PR does

1. lock_table_remote.go — unlock() / getLock() tight loops

2. txn/service/service.go — parallelSendWithRetry() tight loop

3. lockop/lock_op.go — getRetryWaitDuration() budget=0 guard

Test verification

Uh oh!

XuPeng-SH left a comment

Choose a reason for hiding this comment

Self-Review

Overall: Clean — all three fixes are well-scoped and address real CPU-spinning hotspots

Issues found and fixed during review

Design note: unlock "must ensure" contract vs 30s deadline

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `lock_table_remote.go` — `unlock()` / `getLock()` tight loops

2. `txn/service/service.go` — `parallelSendWithRetry()` tight loop

3. `lockop/lock_op.go` — `getRetryWaitDuration()` budget=0 guard

Design note: `unlock` "must ensure" contract vs 30s deadline