Search before asking
What happened
A CDC job was running normally with checkpoints completing at regular intervals. At some point during a rolling restart, checkpoint completion logs stopped entirely. From that point onward, the coordinator logs only showed MessageDelayedEvent entries — records being streamed from the source — but no pending checkpoint notify finished or start notify checkpoint completed entries appeared at all.
Despite this, the job showed as RUNNING in both the REST API (/running-jobs) and the UI. Data was continuing to flow to the sink. There was no error, no indication anything was wrong.
This continued for several days across multiple coordinator restarts. Each time a new master came up, the job appeared in the running jobs list with RUNNING status.No checkpoint was ever written to S3 after the initial failure point.
During the next rolling restart:
- The new master called
restoreAllRunningJobFromMasterNodeSwitch()
- Found the jobId in
runningJobInfoIMap
- Read the last checkpoint from S3 — which pointed to a binlog file from the original failure, now purged from MySQL
- Job failed immediately with:
IllegalStateException: The connector is trying to read binlog starting at
file=mysql-bin-changelog.XXXXXX, pos=XXXXXXX
but this is no longer available on the server.
Reconfigure the connector to use a snapshot when needed.
Anticipated root cause
When a job fails and the coordinator starts cleanJob(), the sequence is:
- Coordinator writes terminal state (
FAILED) to runningJobStateIMap
cleanJob() is triggered → eventually calls removeJobIMap() which removes the jobId from runningJobInfoIMap
If the coordinator pod is killed (OOM, rolling restart, eviction) between step 1 and step 2, the jobId remains as a zombie in runningJobInfoIMap with FAILED
state in runningJobStateIMap.
On the next master switch, restoreJobFromMasterActiveSwitch() does:
Object jobState = runningJobStateIMap.get(jobId);
if (jobState == null) {
runningJobInfoIMap.remove(jobId);
return;
}
// proceeds to restore — no terminal state check
Since FAILED != null, it proceeds to restore the job. A new JobMaster is created, the job appears as RUNNING in the API and UI, but the underlying checkpoint coordinator may immediately fail again (same root cause), leaving the job in a state where:
- Worker tasks run without an active checkpoint coordinator
- No checkpoint barriers are injected
- No checkpoints are written to S3
- The job shows RUNNING in the UI indefinitely
The last written checkpoint on S3 remains frozen at the pre-failure position. On any subsequent master switch or rollout, this stale checkpoint is used for restoration, leading to the binlog-purged error above.
Fix :
In CoordinatorService.restoreJobFromMasterActiveSwitch(), add a terminal state check after the existing null check:
if (jobState == null) {
runningJobInfoIMap.remove(jobId);
return;
}
// NEW: do not restore jobs already in terminal state
// coordinator was killed before cleanJob() could remove them from IMap
if (jobState instanceof JobStatus && ((JobStatus) jobState).isEndState()) {
logger.warning(String.format(
"Job %s is in terminal state %s, skipping restore and cleaning up IMap",
jobId, jobState));
runningJobInfoIMap.remove(jobId);
return;
}
This covers FAILED, CANCELED, FINISHED, SAVEPOINT_DONE, UNKNOWABLE.
Additional observation — missing UI visibility
There is currently no way in the SeaTunnel UI or REST API to see when the last checkpoint completed for a running job (e.g. "last checkpointed 30s ago"). This makes it impossible to detect the silent no-checkpoint scenario described above — the job appears healthy with RUNNING status while checkpointing has completely stopped.
A lastCheckpointTime or secondsSinceLastCheckpoint field exposed via /running-jobs or the UI would allow operators to immediately isolate jobs where checkpointing has silently stopped, rather than discovering the problem only at the next rollout when the stale checkpoint causes a restoration failure.
SeaTunnel Version
SeaTunnel version: 2.3.12 (Zeta engine)
Environment: Kubernetes, S3 checkpoint storage
Are you willing to submit PR?
Code of Conduct
Search before asking
What happened
A CDC job was running normally with checkpoints completing at regular intervals. At some point during a rolling restart, checkpoint completion logs stopped entirely. From that point onward, the coordinator logs only showed
MessageDelayedEvententries — records being streamed from the source — but nopending checkpoint notify finishedorstart notify checkpoint completedentries appeared at all.Despite this, the job showed as RUNNING in both the REST API (
/running-jobs) and the UI. Data was continuing to flow to the sink. There was no error, no indication anything was wrong.This continued for several days across multiple coordinator restarts. Each time a new master came up, the job appeared in the running jobs list with RUNNING status.No checkpoint was ever written to S3 after the initial failure point.
During the next rolling restart:
restoreAllRunningJobFromMasterNodeSwitch()runningJobInfoIMapIllegalStateException: The connector is trying to read binlog starting at
file=mysql-bin-changelog.XXXXXX, pos=XXXXXXX
but this is no longer available on the server.
Reconfigure the connector to use a snapshot when needed.
Anticipated root cause
When a job fails and the coordinator starts
cleanJob(), the sequence is:FAILED) torunningJobStateIMapcleanJob()is triggered → eventually callsremoveJobIMap()which removes the jobId fromrunningJobInfoIMapIf the coordinator pod is killed (OOM, rolling restart, eviction) between step 1 and step 2, the jobId remains as a zombie in
runningJobInfoIMapwithFAILEDstate in
runningJobStateIMap.On the next master switch,
restoreJobFromMasterActiveSwitch()does:Fix :
This covers FAILED, CANCELED, FINISHED, SAVEPOINT_DONE, UNKNOWABLE.
Additional observation — missing UI visibility
There is currently no way in the SeaTunnel UI or REST API to see when the last checkpoint completed for a running job (e.g. "last checkpointed 30s ago"). This makes it impossible to detect the silent no-checkpoint scenario described above — the job appears healthy with RUNNING status while checkpointing has completely stopped.
A lastCheckpointTime or secondsSinceLastCheckpoint field exposed via /running-jobs or the UI would allow operators to immediately isolate jobs where checkpointing has silently stopped, rather than discovering the problem only at the next rollout when the stale checkpoint causes a restoration failure.
SeaTunnel Version
SeaTunnel version: 2.3.12 (Zeta engine)
Environment: Kubernetes, S3 checkpoint
storageAre you willing to submit PR?
Code of Conduct