Skip to content

[BUG] UnitedDeployment: negative maxUnavailable written to child workload during adaptive rescheduling scale-down race #2412

@rakshaak29

Description

@rakshaak29

What happened:

I was working with a UnitedDeployment that had reserveUnschedulablePods: true enabled under the adaptive schedule strategy. While a subset was in an unschedulable state and I was scaling down at the same time, I noticed that all the pods in that subset started restarting all at once — no rolling update behaviour, just everything going down together. That's when I dug into the controller logic and found something off.

The controller computes a maxUnavailable value to overwrite the child workload's (CloneSet/StatefulSet) rolling update limit when a subset is unschedulable. The formula is:

maxUnavailable = Spec.Replicas - Status.ReadyReplicas + UpdateTimeoutPods

The problem is that during a scale-down, there's a natural race window where ReadyReplicas can temporarily be higher than Spec.Replicas — new pods from the receiving subset are already up and ready before the excess pods on the current subset have finished terminating. When that happens, the subtraction goes negative, and that negative value gets directly written as MaxUnavailable into the child workload. As far as the CloneSet or StatefulSet is concerned, there's no limit anymore, and it lets everything roll at once.

What you expected to happen:

The rolling update limit should stay sane even during transient conditions. If the computed value ends up negative for whatever reason, the controller should treat it as 0 (no additional pods may go unavailable) rather than silently disabling all limits. It's a small race window, but the consequences — an uncontrolled full restart during an already degraded state — feel pretty severe for production workloads.

How to reproduce it (as minimally and precisely as possible):

  1. Set up a UnitedDeployment with at least two subsets and adaptive strategy:
spec:
  topology:
    scheduleStrategy:
      type: Adaptive
      adaptive:
        reserveUnschedulablePods: true
    subsets:
      - name: subset-a
        nodeSelectorTerm: ...
      - name: subset-b
        nodeSelectorTerm: ...
  1. Trigger a situation where subset-a becomes unschedulable — either by tainting the nodes or causing pods to stay pending past the rescheduling threshold.

  2. While rescheduling is happening and pods from subset-b are coming up, reduce spec.replicas at the same time.

  3. Watch the pods in subset-a. During the window where subset-b's pods are ready but subset-a's excess pods haven't terminated yet, the controller reconciles and sets MaxUnavailable to a negative number on the child workload.

  4. The child workload (CloneSet or StatefulSet) treats it as essentially unlimited, and all pods in subset-a go down simultaneously.

Anything else we need to know?:

The bug only affects setups where reserveUnschedulablePods: true is configured — so it's behind a flag, which might be why it hasn't surfaced widely. But the race condition itself is not unusual at all; it happens during any normal scale-down while rescheduling is active. The tricky part is that there's no error, no warning in the logs — the controller just quietly sets a bad value and the workload behaves unexpectedly.

Environment:

  • Kruise version: tested on latest main
  • Kubernetes version (use kubectl version): v1.28+
  • Install details (e.g. helm install args): standard helm install, no custom flags
  • Others: only reproducible when reserveUnschedulablePods: true is set on the UnitedDeployment

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions