Skip to content

[upgrade-lifecycle] High-severity upgrade risks: stale-marker cleanup can remove active home and zero-backoff retry storms #13505

@github-actions

Description

@github-actions

Findings

1. Stale upgrade marker after rollback-on-mark-failure can remove the active install on watcher cleanup

Priority: P0 (common managed/CLI upgrade path under IO failures)

Platform: Linux, macOS, Windows

Location

  • internal/pkg/agent/application/upgrade/upgrade.go:488-496
  • internal/pkg/agent/application/upgrade/upgrade.go:604-623
  • internal/pkg/agent/cmd/watch.go:274-283

Evidence

  • On marker-write path failure, upgrade rolls back install but does not clear marker:
    if err = u.markUpgrade(...); err != nil {
        rollbackErr := u.rollbackInstall(...)
        return nil, goerrors.Join(err, rollbackErr)
    }
  • rollbackInstall restores symlink/removes new install/clears TTL markers only; no marker cleanup:
    err = os.RemoveAll(newAgentInstallPath)
    ...
    err = rollbackSource.Set(nil)
  • If watcher later runs a successful watch path, cleanup keeps marker.VersionedHome (the new path), not the currently running home:
    newVersionedHome := marker.VersionedHome
    versionedHomesToKeep = append(versionedHomesToKeep, newVersionedHome)
    err = installModifier.Cleanup(..., versionedHomesToKeep...)

What is wrong
A post-markUpgrade failure (for example disk-full while updating active.commit) leaves a stale marker pointing at the new versioned home even though rollback removed that home. The next successful watcher cleanup can then delete the actually running old install when rollback TTL entries are absent.

Why it matters
This can turn a recoverable failed upgrade into a broken installation on next restart/cleanup, affecting normal upgrade flows after real-world IO failures.

Suggested fix direction

  1. On markUpgrade failure path, explicitly remove or downgrade the marker before returning.
  2. In watcher success cleanup, verify marker.VersionedHome exists; if missing, keep paths.VersionedHome(topDir) as the authoritative active home.
  3. Add a guard in cleanup to never remove the current runtime home.

Candidate failing test(s)

  • internal/pkg/agent/cmd/watch_test.go: marker points to missing VersionedHome; cleanup must preserve current paths.VersionedHome(topDir).
  • internal/pkg/agent/application/upgrade/upgrade_test.go: simulate markUpgrade failing after marker write and assert marker is removed/converted to safe terminal state.

2. Non-positive download retry backoff allows immediate/negative retry intervals (retry storm)

Priority: P1 (broad impact under transient download failures)

Platform: Linux, macOS, Windows

Location

  • internal/pkg/agent/application/upgrade/step_download.go:258-260
  • internal/pkg/agent/application/upgrade/artifact/config.go:57-61
  • internal/pkg/agent/application/upgrade/artifact/config.go:175-201

Evidence

  • Retry interval is set directly from config without validation:
    expBo := backoff.NewExponentialBackOff()
    expBo.InitialInterval = settings.RetrySleepInitDuration
  • Config unpack path accepts retry_sleep_init_duration but has no lower-bound check:
    RetrySleepInitDuration time.Duration `yaml:"retry_sleep_init_duration" ...`
    if err := cfg.Unpack(&tmp); err != nil { return err }
  • cenkalti/backoff/v4 behavior with non-positive initial interval is immediate/negative backoff (0s or negative durations), enabling tight retry loops.

What is wrong
A policy value of 0s (or negative) for agent.download.retry_sleep_init_duration bypasses intended exponential waiting and can retry continuously when downloads fail.

Why it matters
During artifact outage or network issues, agents can hammer artifact endpoints and consume CPU/network aggressively instead of backing off.

Suggested fix direction

  1. Validate RetrySleepInitDuration > 0 during config unpack/reload.
  2. Clamp invalid values to a safe default (30s) and log a warning.
  3. Optionally enforce a minimum floor in downloadWithRetries as a final safety net.

Candidate failing test(s)

  • internal/pkg/agent/application/upgrade/step_download_test.go: set RetrySleepInitDuration=0 and assert retries do not occur immediately.
  • internal/pkg/agent/application/upgrade/artifact/config_test.go: invalid non-positive duration is rejected or normalized.

Priority ranking

  1. P0: stale-marker cleanup deleting active home after rollback-on-mark-failure.
  2. P1: non-positive retry backoff causing retry storms.

Upgrade paths audited and found safe

  • Marker load/no-marker fast path in watcher startup: internal/pkg/agent/cmd/watch.go:175-187.
  • Upgrade marker terminal-state shortcut cleanup path: internal/pkg/agent/cmd/watch.go:191-220.
  • Upgrade symlink rollback primitive restores target and clears rollback TTL source: internal/pkg/agent/application/upgrade/upgrade.go:604-623.
  • Manual rollback TTL tracking and filtering logic for valid rollback entries: internal/pkg/agent/application/upgrade/manual_rollback.go:254-275.

Note

🔒 Integrity filtering filtered 2 items

Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

  • issue:#unknown (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
  • #8176 (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)

What is this? | From workflow: Sweeper: Upgrade and Rollback Lifecycle

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

  • expires on Apr 14, 2026, 9:42 AM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions