Skip to content

[fleet-enrollment-resilience] Managed-mode resilience gaps: delayed-enroll retry storm and liveness misses Fleet failure #13549

@github-actions

Description

@github-actions

Findings

1. Delayed enrollment can enter a tight infinite retry loop on permanent auth failure (highest priority)

Location

  • internal/pkg/agent/cmd/run.go:691-703
  • internal/pkg/agent/application/enroll/enroll.go:118-119
  • internal/pkg/fleetapi/enroll_cmd.go:240-242

Evidence

tryDelayEnroll retries forever with no sleep around c.Execute(...):

for {
    if ctx.Err() != nil { return nil, ctx.Err() }
    err = c.Execute(ctx, cli.NewIOStreams())
    if err == nil { break }
    logger.Error(fmt.Errorf("failed to perform delayed enrollment (will try again): %w", err))
}

retryEnroll (inside enrollment) explicitly stops retrying on invalid token:

case errors.Is(err, ...), errors.Is(err, fleetapi.ErrInvalidToken), ...:
    break RETRYLOOP

Unauthorized enrollment maps to ErrInvalidToken:

if resp.StatusCode == http.StatusUnauthorized {
    return nil, ErrInvalidToken
}

What is wrong
A permanent invalid-token error exits enrollment's internal backoff, then tryDelayEnroll immediately calls enrollment again in a for {} loop with no delay.

Why it matters
A realistic misconfiguration (expired/invalid enrollment token) can cause sustained rapid retries during startup, generating log/API storms and leaving the agent stuck in non-recovering delayed-enroll behavior.

Suggested fix
In tryDelayEnroll, add outer backoff and error classification for permanent failures. For permanent auth errors (invalid token), fail fast with explicit terminal state instead of immediate retry.

Failing test to add

  • New test around delayed enrollment path (e.g. in internal/pkg/agent/cmd/run*_test.go) that stubs enroll execution to return fleetapi.ErrInvalidToken and asserts retries are backoff-bounded (or stop after terminal classification), not tight-looped.

2. /liveness?failon=degraded|failed ignores managed Fleet connectivity failure state (high priority)

Location

  • internal/pkg/agent/application/coordinator/coordinator.go:1554-1564
  • internal/pkg/agent/application/coordinator/coordinator_state.go:246-265,280-283
  • internal/pkg/agent/application/monitoring/liveness.go:81-96

Evidence
Managed-mode gateway errors set FleetState via setFleetState(...):

// coordinator.go
case configErr := <-c.managerChans.configManagerError:
    ...
    c.setFleetState(agentclient.Failed, configErr.Error())

But overall State used by liveness is derived from CoordinatorState/component errors, not FleetState:

// coordinator_state.go
} else {
    s.State = s.CoordinatorState
    s.Message = s.CoordinatorMessage
}

Liveness only checks state.State and component/unit states:

// liveness.go
unhealthyState := ... state.State == agentclient.Failed ...
unhealthyComponent := ... HaveState(state.Components, ...)

What is wrong
Managed Fleet control-plane failure can be recorded in FleetState while liveness continues returning 200 because that state is not part of liveness health evaluation.

Why it matters
In managed deployments, an agent can be effectively disconnected from Fleet (no policy/action flow) while probes still report healthy, delaying remediation and masking control-plane outage at scale.

Suggested fix
Include Fleet state in liveness evaluation for managed mode (or fold Fleet failure/degraded into aggregate State used by liveness when failon requests degraded/failed semantics).

Failing test to add

  • Extend internal/pkg/agent/application/monitoring/liveness_test.go with a managed-state fixture where FleetState=Failed and components remain healthy; assert ?failon=failed returns 500.
  • Add corresponding FleetState=Degraded + ?failon=degraded case returning 500.

Priority order

  1. Delayed-enroll permanent-auth tight retry loop (unrecoverable startup + retry storm)
  2. Liveness ignoring managed Fleet failure (control-plane outage masked as healthy)

Communication paths audited and found resilient

  • Check-in elapsed-time measurement uses monotonic-safe pattern (internal/pkg/fleetapi/checkin_cmd.go, time.Now() + time.Since() on same base).
  • Gateway check-in backoff loop is interruptible and resets normal schedule after successful non-unauthorized check-ins (internal/pkg/agent/application/gateway/fleet/fleet_gateway.go).
  • Enrollment path uses exponential backoff for transient enrollment failures (internal/pkg/agent/application/enroll/enroll.go).

Note

🔒 Integrity filtering filtered 53 items

Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.


What is this? | From workflow: Sweeper: Fleet Enrollment and Communication Resilience

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

  • expires on Apr 16, 2026, 9:43 AM UTC

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions