retry: add backoff and bounds to unbounded retry loops#24107
retry: add backoff and bounds to unbounded retry loops#24107XuPeng-SH wants to merge 2 commits intomatrixorigin:3.0-devfrom
Conversation
Three retry patterns in the codebase spin without sleep or upper bound, causing CPU waste and potential cascading failures when remote services are temporarily unavailable. 1. lock_table_remote.go unlock()/getLock(): add exponential backoff (100ms->5s) with 30s deadline to prevent tight-loop CPU spinning when the remote lock service is unreachable. 2. txn/service/service.go parallelSendWithRetry(): add exponential backoff (100ms->1s) between Send failures, reset on success. Already bounded by ctx.Done(), but the tight loop hammers the sender unnecessarily during outages. 3. lockop/lock_op.go getRetryWaitDuration(): fix the budget<=0 guard from PR matrixorigin#24105 so that disabling backend retry budget (budget=0) only prevents retries for backend availability errors (ErrBackendCannotConnect, ErrBackendClosed, etc), not for all errors. Normal retryable errors like ErrLockTableBindChanged still retry normally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
95d68d1 to
69a727f
Compare
XuPeng-SH
left a comment
There was a problem hiding this comment.
Self-Review
Overall: Clean — all three fixes are well-scoped and address real CPU-spinning hotspots
Changes reviewed:
lock_table_remote.go: exponential backoff + 30s deadline forunlock()/getLock()txn/service/service.go: context-aware backoff forparallelSendWithRetry()lockop/lock_op.go: budget=0 guard fix + new test
Issues found and fixed during review
1. [MEDIUM → FIXED] parallelSendWithRetry: time.Sleep was not context-aware
Original code used time.Sleep(backoff) which blocks for up to 1s even if ctx is cancelled. Since this is used in TN recovery and rollback forwarding paths, prompt cancellation on shutdown matters.
Fixed by replacing with time.NewTimer+select pattern that exits immediately on context cancellation.
2. [LOW → FIXED] unlock/getLock error logs lacked identifiers
The "retry budget exhausted" log messages only had budget duration. Added table-id and txn fields for debuggability.
Design note: unlock "must ensure" contract vs 30s deadline
The comment at line 176 says "unlock cannot fail and must ensure that all locks have been released." The 30s deadline weakens this guarantee. In practice, handleError → getLockTableBind detects bind changes well before 30s (the allocator reassigns the lock table, and bindChangedHandler releases local state). The 30s is a safety net for the edge case where:
- Remote lock service is unreachable
- Allocator still reports the same bind (hasn't timed out the service yet)
getLockTableBindalso fails (returnsoldError)
In this pathological case, spinning indefinitely at 100% CPU is worse than giving up after 30s. The lock table bind will eventually change via the allocator's health checks, at which point the lock cleanup happens through the normal bind-change path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What type of PR is this?
Which issue(s) this PR fixes
Follow-up to #24105. Addresses remaining unbounded retry loops discovered during that review.
What this PR does
Three retry patterns in the codebase spin without sleep or upper bound, causing CPU waste and potential cascading failures when remote services are temporarily unavailable:
1.
lock_table_remote.go—unlock()/getLock()tight loopsBoth methods retry in a bare
for {}loop with NO sleep between iterations. WhenhandleError()returns a non-nil error (remote lock service unreachable, bind unchanged), the loop spins at full CPU speed.Fix: Add exponential backoff (100ms → 5s cap) with a 30-second maximum retry duration. If the bind changes (
handleErrorreturns nil), the loop still exits immediately as before.2.
txn/service/service.go—parallelSendWithRetry()tight loopAlready bounded by
ctx.Done(), but thecontinueonsender.Send()failure loops immediately with no sleep, hammering the sender during outages.Fix: Add exponential backoff (100ms → 1s cap), reset on success.
3.
lockop/lock_op.go—getRetryWaitDuration()budget=0 guardPR #24105 introduced
defaultMaxWaitTimeOnRetryBackendLockto bound backend error retries. However, when this budget is set to 0 (disabled), the guardreturn 0, falsedisables ALL retries — including normal retryable errors likeErrLockTableBindChanged.Fix: When budget ≤ 0, only fail-fast for backend availability errors (
isBoundedRetryLockError). Normal retryable errors still usedefaultWaitTimeOnRetryLock. Added a test to verifyErrLockTableBindChangedstill retries normally when budget=0.Test verification
TestLockWithRetry*tests pass (11/11)TestLockWithRetryStillRetriesBindChangeWhenBackendBudgetDisabledtest added and passingTest_parallelSendWithRetrypassesTestRemoteLockFailedInRollingRestartCNpasses