You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is wrong
A post-markUpgrade failure (for example disk-full while updating active.commit) leaves a stale marker pointing at the new versioned home even though rollback removed that home. The next successful watcher cleanup can then delete the actually running old install when rollback TTL entries are absent.
Why it matters
This can turn a recoverable failed upgrade into a broken installation on next restart/cleanup, affecting normal upgrade flows after real-world IO failures.
Suggested fix direction
On markUpgrade failure path, explicitly remove or downgrade the marker before returning.
In watcher success cleanup, verify marker.VersionedHome exists; if missing, keep paths.VersionedHome(topDir) as the authoritative active home.
Add a guard in cleanup to never remove the current runtime home.
Candidate failing test(s)
internal/pkg/agent/cmd/watch_test.go: marker points to missing VersionedHome; cleanup must preserve current paths.VersionedHome(topDir).
internal/pkg/agent/application/upgrade/upgrade_test.go: simulate markUpgrade failing after marker write and assert marker is removed/converted to safe terminal state.
cenkalti/backoff/v4 behavior with non-positive initial interval is immediate/negative backoff (0s or negative durations), enabling tight retry loops.
What is wrong
A policy value of 0s (or negative) for agent.download.retry_sleep_init_duration bypasses intended exponential waiting and can retry continuously when downloads fail.
Why it matters
During artifact outage or network issues, agents can hammer artifact endpoints and consume CPU/network aggressively instead of backing off.
Suggested fix direction
Validate RetrySleepInitDuration > 0 during config unpack/reload.
Clamp invalid values to a safe default (30s) and log a warning.
Optionally enforce a minimum floor in downloadWithRetries as a final safety net.
Candidate failing test(s)
internal/pkg/agent/application/upgrade/step_download_test.go: set RetrySleepInitDuration=0 and assert retries do not occur immediately.
internal/pkg/agent/application/upgrade/artifact/config_test.go: invalid non-positive duration is rejected or normalized.
Priority ranking
P0: stale-marker cleanup deleting active home after rollback-on-mark-failure.
Manual rollback TTL tracking and filtering logic for valid rollback entries: internal/pkg/agent/application/upgrade/manual_rollback.go:254-275.
Note
🔒 Integrity filtering filtered 2 items
Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
issue:#unknown (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
#8176 (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
Findings
1. Stale upgrade marker after rollback-on-mark-failure can remove the active install on watcher cleanup
Priority: P0 (common managed/CLI upgrade path under IO failures)
Platform: Linux, macOS, Windows
Location
internal/pkg/agent/application/upgrade/upgrade.go:488-496internal/pkg/agent/application/upgrade/upgrade.go:604-623internal/pkg/agent/cmd/watch.go:274-283Evidence
rollbackInstallrestores symlink/removes new install/clears TTL markers only; no marker cleanup:marker.VersionedHome(the new path), not the currently running home:What is wrong
A post-
markUpgradefailure (for example disk-full while updatingactive.commit) leaves a stale marker pointing at the new versioned home even though rollback removed that home. The next successful watcher cleanup can then delete the actually running old install when rollback TTL entries are absent.Why it matters
This can turn a recoverable failed upgrade into a broken installation on next restart/cleanup, affecting normal upgrade flows after real-world IO failures.
Suggested fix direction
markUpgradefailure path, explicitly remove or downgrade the marker before returning.marker.VersionedHomeexists; if missing, keeppaths.VersionedHome(topDir)as the authoritative active home.Candidate failing test(s)
internal/pkg/agent/cmd/watch_test.go: marker points to missingVersionedHome; cleanup must preserve currentpaths.VersionedHome(topDir).internal/pkg/agent/application/upgrade/upgrade_test.go: simulatemarkUpgradefailing after marker write and assert marker is removed/converted to safe terminal state.2. Non-positive download retry backoff allows immediate/negative retry intervals (retry storm)
Priority: P1 (broad impact under transient download failures)
Platform: Linux, macOS, Windows
Location
internal/pkg/agent/application/upgrade/step_download.go:258-260internal/pkg/agent/application/upgrade/artifact/config.go:57-61internal/pkg/agent/application/upgrade/artifact/config.go:175-201Evidence
retry_sleep_init_durationbut has no lower-bound check:cenkalti/backoff/v4behavior with non-positive initial interval is immediate/negative backoff (0s or negative durations), enabling tight retry loops.What is wrong
A policy value of
0s(or negative) foragent.download.retry_sleep_init_durationbypasses intended exponential waiting and can retry continuously when downloads fail.Why it matters
During artifact outage or network issues, agents can hammer artifact endpoints and consume CPU/network aggressively instead of backing off.
Suggested fix direction
RetrySleepInitDuration > 0during config unpack/reload.30s) and log a warning.downloadWithRetriesas a final safety net.Candidate failing test(s)
internal/pkg/agent/application/upgrade/step_download_test.go: setRetrySleepInitDuration=0and assert retries do not occur immediately.internal/pkg/agent/application/upgrade/artifact/config_test.go: invalid non-positive duration is rejected or normalized.Priority ranking
Upgrade paths audited and found safe
internal/pkg/agent/cmd/watch.go:175-187.internal/pkg/agent/cmd/watch.go:191-220.internal/pkg/agent/application/upgrade/upgrade.go:604-623.internal/pkg/agent/application/upgrade/manual_rollback.go:254-275.Note
🔒 Integrity filtering filtered 2 items
Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)What is this? | From workflow: Sweeper: Upgrade and Rollback Lifecycle
Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.