Motivation
OnProxyDisconnected in pkg/xds/server/callbacks/dataplane_sync_tracker.go:68-81 has a TOCTOU race. It reads the watchdog entry under RLock, releases the lock, then later acquires Lock to delete. In the gap (lines 71-77), OnProxyConnected can replace the entry for the same dpKey. The subsequent delete then removes the new watchdog, leaking its goroutine.
t.RLock()
dpData := t.watchdogs[dpKey] // reads old entry
t.RUnlock() // gap opens
if dpData != nil {
dpData.cancelFunc() // cancels old watchdog
<-dpData.stopped // waits for old watchdog (OnProxyConnected can run here)
t.Lock()
delete(t.watchdogs, dpKey) // deletes NEW entry
}
This is a logical race (not a data race), so -race won't catch it. Under high DP churn (rapid connect/disconnect), this leaks goroutines and causes CP instability.
Implementation information
Hold the write lock for the entire disconnect operation instead of upgrading from RLock to Lock:
func (t *dataplaneSyncTracker) OnProxyDisconnected(...) {
t.Lock()
dpData := t.watchdogs[dpKey]
delete(t.watchdogs, dpKey)
t.Unlock()
if dpData != nil {
dpData.cancelFunc()
<-dpData.stopped
}
}
This eliminates the TOCTOU window. The blocking <-dpData.stopped happens after releasing the lock so it doesn't hold up other operations.
Motivation
OnProxyDisconnectedinpkg/xds/server/callbacks/dataplane_sync_tracker.go:68-81has a TOCTOU race. It reads the watchdog entry underRLock, releases the lock, then later acquiresLockto delete. In the gap (lines 71-77),OnProxyConnectedcan replace the entry for the samedpKey. The subsequentdeletethen removes the new watchdog, leaking its goroutine.This is a logical race (not a data race), so
-racewon't catch it. Under high DP churn (rapid connect/disconnect), this leaks goroutines and causes CP instability.Implementation information
Hold the write lock for the entire disconnect operation instead of upgrading from RLock to Lock:
This eliminates the TOCTOU window. The blocking
<-dpData.stoppedhappens after releasing the lock so it doesn't hold up other operations.