[fleet-enrollment-resilience] Managed-mode resilience gaps: delayed-enroll retry storm and liveness misses Fleet failure

## Findings

### 1. Delayed enrollment can enter a tight infinite retry loop on permanent auth failure (**highest priority**)

**Location**
- `internal/pkg/agent/cmd/run.go:691-703`
- `internal/pkg/agent/application/enroll/enroll.go:118-119`
- `internal/pkg/fleetapi/enroll_cmd.go:240-242`

**Evidence**

`tryDelayEnroll` retries forever with no sleep around `c.Execute(...)`:

```go
for {
    if ctx.Err() != nil { return nil, ctx.Err() }
    err = c.Execute(ctx, cli.NewIOStreams())
    if err == nil { break }
    logger.Error(fmt.Errorf("failed to perform delayed enrollment (will try again): %w", err))
}
```

`retryEnroll` (inside enrollment) explicitly stops retrying on invalid token:

```go
case errors.Is(err, ...), errors.Is(err, fleetapi.ErrInvalidToken), ...:
    break RETRYLOOP
```

Unauthorized enrollment maps to `ErrInvalidToken`:

```go
if resp.StatusCode == http.StatusUnauthorized {
    return nil, ErrInvalidToken
}
```

**What is wrong**
A permanent invalid-token error exits enrollment's internal backoff, then `tryDelayEnroll` immediately calls enrollment again in a `for {}` loop with no delay.

**Why it matters**
A realistic misconfiguration (expired/invalid enrollment token) can cause sustained rapid retries during startup, generating log/API storms and leaving the agent stuck in non-recovering delayed-enroll behavior.

**Suggested fix**
In `tryDelayEnroll`, add outer backoff and error classification for permanent failures. For permanent auth errors (invalid token), fail fast with explicit terminal state instead of immediate retry.

**Failing test to add**
- New test around delayed enrollment path (e.g. in `internal/pkg/agent/cmd/run*_test.go`) that stubs enroll execution to return `fleetapi.ErrInvalidToken` and asserts retries are backoff-bounded (or stop after terminal classification), not tight-looped.

---

### 2. `/liveness?failon=degraded|failed` ignores managed Fleet connectivity failure state (**high priority**)

**Location**
- `internal/pkg/agent/application/coordinator/coordinator.go:1554-1564`
- `internal/pkg/agent/application/coordinator/coordinator_state.go:246-265,280-283`
- `internal/pkg/agent/application/monitoring/liveness.go:81-96`

**Evidence**
Managed-mode gateway errors set **FleetState** via `setFleetState(...)`:

```go
// coordinator.go
case configErr := <-c.managerChans.configManagerError:
    ...
    c.setFleetState(agentclient.Failed, configErr.Error())
```

But overall `State` used by liveness is derived from `CoordinatorState`/component errors, not `FleetState`:

```go
// coordinator_state.go
} else {
    s.State = s.CoordinatorState
    s.Message = s.CoordinatorMessage
}
```

Liveness only checks `state.State` and component/unit states:

```go
// liveness.go
unhealthyState := ... state.State == agentclient.Failed ...
unhealthyComponent := ... HaveState(state.Components, ...)
```

**What is wrong**
Managed Fleet control-plane failure can be recorded in `FleetState` while liveness continues returning 200 because that state is not part of liveness health evaluation.

**Why it matters**
In managed deployments, an agent can be effectively disconnected from Fleet (no policy/action flow) while probes still report healthy, delaying remediation and masking control-plane outage at scale.

**Suggested fix**
Include Fleet state in liveness evaluation for managed mode (or fold Fleet failure/degraded into aggregate `State` used by liveness when `failon` requests degraded/failed semantics).

**Failing test to add**
- Extend `internal/pkg/agent/application/monitoring/liveness_test.go` with a managed-state fixture where `FleetState=Failed` and components remain healthy; assert `?failon=failed` returns 500.
- Add corresponding `FleetState=Degraded` + `?failon=degraded` case returning 500.

## Priority order
1. Delayed-enroll permanent-auth tight retry loop (unrecoverable startup + retry storm)
2. Liveness ignoring managed Fleet failure (control-plane outage masked as healthy)

## Communication paths audited and found resilient
- Check-in elapsed-time measurement uses monotonic-safe pattern (`internal/pkg/fleetapi/checkin_cmd.go`, `time.Now()` + `time.Since()` on same base).
- Gateway check-in backoff loop is interruptible and resets normal schedule after successful non-unauthorized check-ins (`internal/pkg/agent/application/gateway/fleet/fleet_gateway.go`).
- Enrollment path uses exponential backoff for transient enrollment failures (`internal/pkg/agent/application/enroll/enroll.go`).




> [!NOTE]
> <details>
> <summary>🔒 Integrity filtering filtered 53 items</summary>
>
> Integrity filtering activated and filtered the following items during workflow execution.
> This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
>
> - issue:#unknown (`search_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13545 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13536 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13523 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13521 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13513 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13505 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13494 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13440 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13412 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13410 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13369 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13368 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13367 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13353 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - issue:elastic/elastic-agent#13352 (`list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
> - ... and 37 more items
>
> </details>


---
[What is this?](https://ela.st/github-ai-tools) | [From workflow: Sweeper: Fleet Enrollment and Communication Resilience](https://github.com/elastic/elastic-agent/actions/runs/24182950280)

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.
> - [x] expires  on Apr 16, 2026, 9:43 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fleet-enrollment-resilience] Managed-mode resilience gaps: delayed-enroll retry storm and liveness misses Fleet failure #13549

Findings

1. Delayed enrollment can enter a tight infinite retry loop on permanent auth failure (highest priority)

2. `/liveness?failon=degraded|failed` ignores managed Fleet connectivity failure state (high priority)

Priority order

Communication paths audited and found resilient

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[fleet-enrollment-resilience] Managed-mode resilience gaps: delayed-enroll retry storm and liveness misses Fleet failure #13549

Description

Findings

1. Delayed enrollment can enter a tight infinite retry loop on permanent auth failure (highest priority)

2. /liveness?failon=degraded|failed ignores managed Fleet connectivity failure state (high priority)

Priority order

Communication paths audited and found resilient

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `/liveness?failon=degraded|failed` ignores managed Fleet connectivity failure state (high priority)