Skip to content

fix(xds): TOCTOU race in dataplaneSyncTracker.OnProxyDisconnected #16162

@Automaat

Description

@Automaat

Motivation

OnProxyDisconnected in pkg/xds/server/callbacks/dataplane_sync_tracker.go:68-81 has a TOCTOU race. It reads the watchdog entry under RLock, releases the lock, then later acquires Lock to delete. In the gap (lines 71-77), OnProxyConnected can replace the entry for the same dpKey. The subsequent delete then removes the new watchdog, leaking its goroutine.

t.RLock()
dpData := t.watchdogs[dpKey]  // reads old entry
t.RUnlock()                    // gap opens

if dpData != nil {
    dpData.cancelFunc()        // cancels old watchdog
    <-dpData.stopped           // waits for old watchdog (OnProxyConnected can run here)
    t.Lock()
    delete(t.watchdogs, dpKey) // deletes NEW entry
}

This is a logical race (not a data race), so -race won't catch it. Under high DP churn (rapid connect/disconnect), this leaks goroutines and causes CP instability.

Implementation information

Hold the write lock for the entire disconnect operation instead of upgrading from RLock to Lock:

func (t *dataplaneSyncTracker) OnProxyDisconnected(...) {
    t.Lock()
    dpData := t.watchdogs[dpKey]
    delete(t.watchdogs, dpKey)
    t.Unlock()

    if dpData != nil {
        dpData.cancelFunc()
        <-dpData.stopped
    }
}

This eliminates the TOCTOU window. The blocking <-dpData.stopped happens after releasing the lock so it doesn't hold up other operations.

Metadata

Metadata

Assignees

Labels

triage/acceptedThe issue was reviewed and is complete enough to start working on it

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions