Skip to content

retry: add backoff and bounds to unbounded retry loops#24107

Open
XuPeng-SH wants to merge 2 commits intomatrixorigin:3.0-devfrom
XuPeng-SH:fix-unbounded-retries
Open

retry: add backoff and bounds to unbounded retry loops#24107
XuPeng-SH wants to merge 2 commits intomatrixorigin:3.0-devfrom
XuPeng-SH:fix-unbounded-retries

Conversation

@XuPeng-SH
Copy link
Copy Markdown
Contributor

What type of PR is this?

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance optimization
  • Refactoring (no functional changes)
  • Code style update
  • Build related changes
  • CI related changes
  • Documentation update
  • Other

Which issue(s) this PR fixes

Follow-up to #24105. Addresses remaining unbounded retry loops discovered during that review.

What this PR does

Three retry patterns in the codebase spin without sleep or upper bound, causing CPU waste and potential cascading failures when remote services are temporarily unavailable:

1. lock_table_remote.gounlock() / getLock() tight loops

Both methods retry in a bare for {} loop with NO sleep between iterations. When handleError() returns a non-nil error (remote lock service unreachable, bind unchanged), the loop spins at full CPU speed.

Fix: Add exponential backoff (100ms → 5s cap) with a 30-second maximum retry duration. If the bind changes (handleError returns nil), the loop still exits immediately as before.

2. txn/service/service.goparallelSendWithRetry() tight loop

Already bounded by ctx.Done(), but the continue on sender.Send() failure loops immediately with no sleep, hammering the sender during outages.

Fix: Add exponential backoff (100ms → 1s cap), reset on success.

3. lockop/lock_op.gogetRetryWaitDuration() budget=0 guard

PR #24105 introduced defaultMaxWaitTimeOnRetryBackendLock to bound backend error retries. However, when this budget is set to 0 (disabled), the guard return 0, false disables ALL retries — including normal retryable errors like ErrLockTableBindChanged.

Fix: When budget ≤ 0, only fail-fast for backend availability errors (isBoundedRetryLockError). Normal retryable errors still use defaultWaitTimeOnRetryLock. Added a test to verify ErrLockTableBindChanged still retries normally when budget=0.

Test verification

  • All existing TestLockWithRetry* tests pass (11/11)
  • New TestLockWithRetryStillRetriesBindChangeWhenBackendBudgetDisabled test added and passing
  • Test_parallelSendWithRetry passes
  • TestRemoteLockFailedInRollingRestartCN passes
  • Full build succeeds

Three retry patterns in the codebase spin without sleep or upper bound,
causing CPU waste and potential cascading failures when remote services
are temporarily unavailable.

1. lock_table_remote.go unlock()/getLock(): add exponential backoff
   (100ms->5s) with 30s deadline to prevent tight-loop CPU spinning
   when the remote lock service is unreachable.

2. txn/service/service.go parallelSendWithRetry(): add exponential
   backoff (100ms->1s) between Send failures, reset on success.
   Already bounded by ctx.Done(), but the tight loop hammers the
   sender unnecessarily during outages.

3. lockop/lock_op.go getRetryWaitDuration(): fix the budget<=0 guard
   from PR matrixorigin#24105 so that disabling backend retry budget (budget=0)
   only prevents retries for backend availability errors
   (ErrBackendCannotConnect, ErrBackendClosed, etc), not for all
   errors. Normal retryable errors like ErrLockTableBindChanged still
   retry normally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@XuPeng-SH XuPeng-SH force-pushed the fix-unbounded-retries branch from 95d68d1 to 69a727f Compare April 9, 2026 12:04
Copy link
Copy Markdown
Contributor Author

@XuPeng-SH XuPeng-SH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-Review

Overall: Clean — all three fixes are well-scoped and address real CPU-spinning hotspots

Changes reviewed:

  • lock_table_remote.go: exponential backoff + 30s deadline for unlock()/getLock()
  • txn/service/service.go: context-aware backoff for parallelSendWithRetry()
  • lockop/lock_op.go: budget=0 guard fix + new test

Issues found and fixed during review

1. [MEDIUM → FIXED] parallelSendWithRetry: time.Sleep was not context-aware

Original code used time.Sleep(backoff) which blocks for up to 1s even if ctx is cancelled. Since this is used in TN recovery and rollback forwarding paths, prompt cancellation on shutdown matters.

Fixed by replacing with time.NewTimer+select pattern that exits immediately on context cancellation.

2. [LOW → FIXED] unlock/getLock error logs lacked identifiers

The "retry budget exhausted" log messages only had budget duration. Added table-id and txn fields for debuggability.

Design note: unlock "must ensure" contract vs 30s deadline

The comment at line 176 says "unlock cannot fail and must ensure that all locks have been released." The 30s deadline weakens this guarantee. In practice, handleErrorgetLockTableBind detects bind changes well before 30s (the allocator reassigns the lock table, and bindChangedHandler releases local state). The 30s is a safety net for the edge case where:

  • Remote lock service is unreachable
  • Allocator still reports the same bind (hasn't timed out the service yet)
  • getLockTableBind also fails (returns oldError)

In this pathological case, spinning indefinitely at 100% CPU is worse than giving up after 30s. The lock table bind will eventually change via the allocator's health checks, at which point the lock cleanup happens through the normal bind-change path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants