What happened:
I was working with a UnitedDeployment that had reserveUnschedulablePods: true enabled under the adaptive schedule strategy. While a subset was in an unschedulable state and I was scaling down at the same time, I noticed that all the pods in that subset started restarting all at once — no rolling update behaviour, just everything going down together. That's when I dug into the controller logic and found something off.
The controller computes a maxUnavailable value to overwrite the child workload's (CloneSet/StatefulSet) rolling update limit when a subset is unschedulable. The formula is:
maxUnavailable = Spec.Replicas - Status.ReadyReplicas + UpdateTimeoutPods
The problem is that during a scale-down, there's a natural race window where ReadyReplicas can temporarily be higher than Spec.Replicas — new pods from the receiving subset are already up and ready before the excess pods on the current subset have finished terminating. When that happens, the subtraction goes negative, and that negative value gets directly written as MaxUnavailable into the child workload. As far as the CloneSet or StatefulSet is concerned, there's no limit anymore, and it lets everything roll at once.
What you expected to happen:
The rolling update limit should stay sane even during transient conditions. If the computed value ends up negative for whatever reason, the controller should treat it as 0 (no additional pods may go unavailable) rather than silently disabling all limits. It's a small race window, but the consequences — an uncontrolled full restart during an already degraded state — feel pretty severe for production workloads.
How to reproduce it (as minimally and precisely as possible):
- Set up a
UnitedDeployment with at least two subsets and adaptive strategy:
spec:
topology:
scheduleStrategy:
type: Adaptive
adaptive:
reserveUnschedulablePods: true
subsets:
- name: subset-a
nodeSelectorTerm: ...
- name: subset-b
nodeSelectorTerm: ...
-
Trigger a situation where subset-a becomes unschedulable — either by tainting the nodes or causing pods to stay pending past the rescheduling threshold.
-
While rescheduling is happening and pods from subset-b are coming up, reduce spec.replicas at the same time.
-
Watch the pods in subset-a. During the window where subset-b's pods are ready but subset-a's excess pods haven't terminated yet, the controller reconciles and sets MaxUnavailable to a negative number on the child workload.
-
The child workload (CloneSet or StatefulSet) treats it as essentially unlimited, and all pods in subset-a go down simultaneously.
Anything else we need to know?:
The bug only affects setups where reserveUnschedulablePods: true is configured — so it's behind a flag, which might be why it hasn't surfaced widely. But the race condition itself is not unusual at all; it happens during any normal scale-down while rescheduling is active. The tricky part is that there's no error, no warning in the logs — the controller just quietly sets a bad value and the workload behaves unexpectedly.
Environment:
- Kruise version: tested on latest main
- Kubernetes version (use
kubectl version): v1.28+
- Install details (e.g. helm install args): standard helm install, no custom flags
- Others: only reproducible when
reserveUnschedulablePods: true is set on the UnitedDeployment
What happened:
I was working with a
UnitedDeploymentthat hadreserveUnschedulablePods: trueenabled under the adaptive schedule strategy. While a subset was in an unschedulable state and I was scaling down at the same time, I noticed that all the pods in that subset started restarting all at once — no rolling update behaviour, just everything going down together. That's when I dug into the controller logic and found something off.The controller computes a
maxUnavailablevalue to overwrite the child workload's (CloneSet/StatefulSet) rolling update limit when a subset is unschedulable. The formula is:The problem is that during a scale-down, there's a natural race window where
ReadyReplicascan temporarily be higher thanSpec.Replicas— new pods from the receiving subset are already up and ready before the excess pods on the current subset have finished terminating. When that happens, the subtraction goes negative, and that negative value gets directly written asMaxUnavailableinto the child workload. As far as the CloneSet or StatefulSet is concerned, there's no limit anymore, and it lets everything roll at once.What you expected to happen:
The rolling update limit should stay sane even during transient conditions. If the computed value ends up negative for whatever reason, the controller should treat it as
0(no additional pods may go unavailable) rather than silently disabling all limits. It's a small race window, but the consequences — an uncontrolled full restart during an already degraded state — feel pretty severe for production workloads.How to reproduce it (as minimally and precisely as possible):
UnitedDeploymentwith at least two subsets and adaptive strategy:Trigger a situation where
subset-abecomes unschedulable — either by tainting the nodes or causing pods to stay pending past the rescheduling threshold.While rescheduling is happening and pods from
subset-bare coming up, reducespec.replicasat the same time.Watch the pods in
subset-a. During the window wheresubset-b's pods are ready butsubset-a's excess pods haven't terminated yet, the controller reconciles and setsMaxUnavailableto a negative number on the child workload.The child workload (CloneSet or StatefulSet) treats it as essentially unlimited, and all pods in
subset-ago down simultaneously.Anything else we need to know?:
The bug only affects setups where
reserveUnschedulablePods: trueis configured — so it's behind a flag, which might be why it hasn't surfaced widely. But the race condition itself is not unusual at all; it happens during any normal scale-down while rescheduling is active. The tricky part is that there's no error, no warning in the logs — the controller just quietly sets a bad value and the workload behaves unexpectedly.Environment:
kubectl version): v1.28+reserveUnschedulablePods: trueis set on the UnitedDeployment