You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is wrong
A permanent invalid-token error exits enrollment's internal backoff, then tryDelayEnroll immediately calls enrollment again in a for {} loop with no delay.
Why it matters
A realistic misconfiguration (expired/invalid enrollment token) can cause sustained rapid retries during startup, generating log/API storms and leaving the agent stuck in non-recovering delayed-enroll behavior.
Suggested fix
In tryDelayEnroll, add outer backoff and error classification for permanent failures. For permanent auth errors (invalid token), fail fast with explicit terminal state instead of immediate retry.
Failing test to add
New test around delayed enrollment path (e.g. in internal/pkg/agent/cmd/run*_test.go) that stubs enroll execution to return fleetapi.ErrInvalidToken and asserts retries are backoff-bounded (or stop after terminal classification), not tight-looped.
2. /liveness?failon=degraded|failed ignores managed Fleet connectivity failure state (high priority)
What is wrong
Managed Fleet control-plane failure can be recorded in FleetState while liveness continues returning 200 because that state is not part of liveness health evaluation.
Why it matters
In managed deployments, an agent can be effectively disconnected from Fleet (no policy/action flow) while probes still report healthy, delaying remediation and masking control-plane outage at scale.
Suggested fix
Include Fleet state in liveness evaluation for managed mode (or fold Fleet failure/degraded into aggregate State used by liveness when failon requests degraded/failed semantics).
Failing test to add
Extend internal/pkg/agent/application/monitoring/liveness_test.go with a managed-state fixture where FleetState=Failed and components remain healthy; assert ?failon=failed returns 500.
Add corresponding FleetState=Degraded + ?failon=degraded case returning 500.
Liveness ignoring managed Fleet failure (control-plane outage masked as healthy)
Communication paths audited and found resilient
Check-in elapsed-time measurement uses monotonic-safe pattern (internal/pkg/fleetapi/checkin_cmd.go, time.Now() + time.Since() on same base).
Gateway check-in backoff loop is interruptible and resets normal schedule after successful non-unauthorized check-ins (internal/pkg/agent/application/gateway/fleet/fleet_gateway.go).
Enrollment path uses exponential backoff for transient enrollment failures (internal/pkg/agent/application/enroll/enroll.go).
Note
🔒 Integrity filtering filtered 53 items
Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
issue:#unknown (search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
issue:[Synthetics] Allow Custom CA #13368 (list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
Findings
1. Delayed enrollment can enter a tight infinite retry loop on permanent auth failure (highest priority)
Location
internal/pkg/agent/cmd/run.go:691-703internal/pkg/agent/application/enroll/enroll.go:118-119internal/pkg/fleetapi/enroll_cmd.go:240-242Evidence
tryDelayEnrollretries forever with no sleep aroundc.Execute(...):retryEnroll(inside enrollment) explicitly stops retrying on invalid token:Unauthorized enrollment maps to
ErrInvalidToken:What is wrong
A permanent invalid-token error exits enrollment's internal backoff, then
tryDelayEnrollimmediately calls enrollment again in afor {}loop with no delay.Why it matters
A realistic misconfiguration (expired/invalid enrollment token) can cause sustained rapid retries during startup, generating log/API storms and leaving the agent stuck in non-recovering delayed-enroll behavior.
Suggested fix
In
tryDelayEnroll, add outer backoff and error classification for permanent failures. For permanent auth errors (invalid token), fail fast with explicit terminal state instead of immediate retry.Failing test to add
internal/pkg/agent/cmd/run*_test.go) that stubs enroll execution to returnfleetapi.ErrInvalidTokenand asserts retries are backoff-bounded (or stop after terminal classification), not tight-looped.2.
/liveness?failon=degraded|failedignores managed Fleet connectivity failure state (high priority)Location
internal/pkg/agent/application/coordinator/coordinator.go:1554-1564internal/pkg/agent/application/coordinator/coordinator_state.go:246-265,280-283internal/pkg/agent/application/monitoring/liveness.go:81-96Evidence
Managed-mode gateway errors set FleetState via
setFleetState(...):But overall
Stateused by liveness is derived fromCoordinatorState/component errors, notFleetState:Liveness only checks
state.Stateand component/unit states:What is wrong
Managed Fleet control-plane failure can be recorded in
FleetStatewhile liveness continues returning 200 because that state is not part of liveness health evaluation.Why it matters
In managed deployments, an agent can be effectively disconnected from Fleet (no policy/action flow) while probes still report healthy, delaying remediation and masking control-plane outage at scale.
Suggested fix
Include Fleet state in liveness evaluation for managed mode (or fold Fleet failure/degraded into aggregate
Stateused by liveness whenfailonrequests degraded/failed semantics).Failing test to add
internal/pkg/agent/application/monitoring/liveness_test.gowith a managed-state fixture whereFleetState=Failedand components remain healthy; assert?failon=failedreturns 500.FleetState=Degraded+?failon=degradedcase returning 500.Priority order
Communication paths audited and found resilient
internal/pkg/fleetapi/checkin_cmd.go,time.Now()+time.Since()on same base).internal/pkg/agent/application/gateway/fleet/fleet_gateway.go).internal/pkg/agent/application/enroll/enroll.go).Note
🔒 Integrity filtering filtered 53 items
Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.
search_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)TEST_RUN_UNTIL_FAILURE=truedoes not stop running tests on failure #13545 (list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)version_conflict_engine_exceptionfor ES and Logstash integrations in version 9.3.2 #13367 (list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)What is this? | From workflow: Sweeper: Fleet Enrollment and Communication Resilience
Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.