[upgrade-lifecycle] High-severity upgrade risks: stale-marker cleanup can remove active home and zero-backoff retry storms

## Findings

### 1. Stale upgrade marker after rollback-on-mark-failure can remove the active install on watcher cleanup

**Priority:** P0 (common managed/CLI upgrade path under IO failures)

**Platform:** Linux, macOS, Windows

**Location**
- `internal/pkg/agent/application/upgrade/upgrade.go:488-496`
- `internal/pkg/agent/application/upgrade/upgrade.go:604-623`
- `internal/pkg/agent/cmd/watch.go:274-283`

**Evidence**
- On marker-write path failure, upgrade rolls back install but does not clear marker:
  ```go
  if err = u.markUpgrade(...); err != nil {
      rollbackErr := u.rollbackInstall(...)
      return nil, goerrors.Join(err, rollbackErr)
  }
  ```
- `rollbackInstall` restores symlink/removes new install/clears TTL markers only; no marker cleanup:
  ```go
  err = os.RemoveAll(newAgentInstallPath)
  ...
  err = rollbackSource.Set(nil)
  ```
- If watcher later runs a successful watch path, cleanup keeps `marker.VersionedHome` (the new path), not the currently running home:
  ```go
  newVersionedHome := marker.VersionedHome
  versionedHomesToKeep = append(versionedHomesToKeep, newVersionedHome)
  err = installModifier.Cleanup(..., versionedHomesToKeep...)
  ```

**What is wrong**
A post-`markUpgrade` failure (for example disk-full while updating `active.commit`) leaves a stale marker pointing at the new versioned home even though rollback removed that home. The next successful watcher cleanup can then delete the actually running old install when rollback TTL entries are absent.

**Why it matters**
This can turn a recoverable failed upgrade into a broken installation on next restart/cleanup, affecting normal upgrade flows after real-world IO failures.

**Suggested fix direction**
1. On `markUpgrade` failure path, explicitly remove or downgrade the marker before returning.
2. In watcher success cleanup, verify `marker.VersionedHome` exists; if missing, keep `paths.VersionedHome(topDir)` as the authoritative active home.
3. Add a guard in cleanup to never remove the current runtime home.

**Candidate failing test(s)**
- `internal/pkg/agent/cmd/watch_test.go`: marker points to missing `VersionedHome`; cleanup must preserve current `paths.VersionedHome(topDir)`.
- `internal/pkg/agent/application/upgrade/upgrade_test.go`: simulate `markUpgrade` failing after marker write and assert marker is removed/converted to safe terminal state.

---

### 2. Non-positive download retry backoff allows immediate/negative retry intervals (retry storm)

**Priority:** P1 (broad impact under transient download failures)

**Platform:** Linux, macOS, Windows

**Location**
- `internal/pkg/agent/application/upgrade/step_download.go:258-260`
- `internal/pkg/agent/application/upgrade/artifact/config.go:57-61`
- `internal/pkg/agent/application/upgrade/artifact/config.go:175-201`

**Evidence**
- Retry interval is set directly from config without validation:
  ```go
  expBo := backoff.NewExponentialBackOff()
  expBo.InitialInterval = settings.RetrySleepInitDuration
  ```
- Config unpack path accepts `retry_sleep_init_duration` but has no lower-bound check:
  ```go
  RetrySleepInitDuration time.Duration `yaml:"retry_sleep_init_duration" ...`
  if err := cfg.Unpack(&tmp); err != nil { return err }
  ```
- `cenkalti/backoff/v4` behavior with non-positive initial interval is immediate/negative backoff (0s or negative durations), enabling tight retry loops.

**What is wrong**
A policy value of `0s` (or negative) for `agent.download.retry_sleep_init_duration` bypasses intended exponential waiting and can retry continuously when downloads fail.

**Why it matters**
During artifact outage or network issues, agents can hammer artifact endpoints and consume CPU/network aggressively instead of backing off.

**Suggested fix direction**
1. Validate `RetrySleepInitDuration > 0` during config unpack/reload.
2. Clamp invalid values to a safe default (`30s`) and log a warning.
3. Optionally enforce a minimum floor in `downloadWithRetries` as a final safety net.

**Candidate failing test(s)**
- `internal/pkg/agent/application/upgrade/step_download_test.go`: set `RetrySleepInitDuration=0` and assert retries do not occur immediately.
- `internal/pkg/agent/application/upgrade/artifact/config_test.go`: invalid non-positive duration is rejected or normalized.

## Priority ranking

1. **P0:** stale-marker cleanup deleting active home after rollback-on-mark-failure.
2. **P1:** non-positive retry backoff causing retry storms.

## Upgrade paths audited and found safe

- Marker load/no-marker fast path in watcher startup: `internal/pkg/agent/cmd/watch.go:175-187`.
- Upgrade marker terminal-state shortcut cleanup path: `internal/pkg/agent/cmd/watch.go:191-220`.
- Upgrade symlink rollback primitive restores target and clears rollback TTL source: `internal/pkg/agent/application/upgrade/upgrade.go:604-623`.
- Manual rollback TTL tracking and filtering logic for valid rollback entries: `internal/pkg/agent/application/upgrade/manual_rollback.go:254-275`.




> [!NOTE]
> <details>
> <summary>🔒 Integrity filtering filtered 2 items</summary>
>
> Integrity filtering activated and filtered the following items during workflow execution.
> This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
>
> - issue:#unknown (`search_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - [#8176](https://github.com/elastic/elastic-agent/issues/8176) (`search_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
>
> </details>


---
[What is this?](https://ela.st/github-ai-tools) | [From workflow: Sweeper: Upgrade and Rollback Lifecycle](https://github.com/elastic/elastic-agent/actions/runs/24074157197)

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.
> - [x] expires  on Apr 14, 2026, 9:42 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[upgrade-lifecycle] High-severity upgrade risks: stale-marker cleanup can remove active home and zero-backoff retry storms #13505

Findings

1. Stale upgrade marker after rollback-on-mark-failure can remove the active install on watcher cleanup

2. Non-positive download retry backoff allows immediate/negative retry intervals (retry storm)

Priority ranking

Upgrade paths audited and found safe

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[upgrade-lifecycle] High-severity upgrade risks: stale-marker cleanup can remove active home and zero-backoff retry storms #13505

Description

Findings

1. Stale upgrade marker after rollback-on-mark-failure can remove the active install on watcher cleanup

2. Non-positive download retry backoff allows immediate/negative retry intervals (retry storm)

Priority ranking

Upgrade paths audited and found safe

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions