Skip to content

[Bug][Zeta] Job appears RUNNING in UI with no checkpointing — next master switch restores from stale checkpoint #10675

@ricky2129

Description

@ricky2129

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

A CDC job was running normally with checkpoints completing at regular intervals. At some point during a rolling restart, checkpoint completion logs stopped entirely. From that point onward, the coordinator logs only showed MessageDelayedEvent entries — records being streamed from the source — but no pending checkpoint notify finished or start notify checkpoint completed entries appeared at all.

Despite this, the job showed as RUNNING in both the REST API (/running-jobs) and the UI. Data was continuing to flow to the sink. There was no error, no indication anything was wrong.

This continued for several days across multiple coordinator restarts. Each time a new master came up, the job appeared in the running jobs list with RUNNING status.No checkpoint was ever written to S3 after the initial failure point.

During the next rolling restart:

  • The new master called restoreAllRunningJobFromMasterNodeSwitch()
  • Found the jobId in runningJobInfoIMap
  • Read the last checkpoint from S3 — which pointed to a binlog file from the original failure, now purged from MySQL
  • Job failed immediately with:

IllegalStateException: The connector is trying to read binlog starting at
file=mysql-bin-changelog.XXXXXX, pos=XXXXXXX
but this is no longer available on the server.
Reconfigure the connector to use a snapshot when needed.

Anticipated root cause

When a job fails and the coordinator starts cleanJob(), the sequence is:

  1. Coordinator writes terminal state (FAILED) to runningJobStateIMap
  2. cleanJob() is triggered → eventually calls removeJobIMap() which removes the jobId from runningJobInfoIMap

If the coordinator pod is killed (OOM, rolling restart, eviction) between step 1 and step 2, the jobId remains as a zombie in runningJobInfoIMap with FAILED
state in runningJobStateIMap.

On the next master switch, restoreJobFromMasterActiveSwitch() does:

Object jobState = runningJobStateIMap.get(jobId);
if (jobState == null) {
    runningJobInfoIMap.remove(jobId);
    return;
}
// proceeds to restore — no terminal state check

Since FAILED != null, it proceeds to restore the job. A new JobMaster is created, the job appears as RUNNING in the API and UI, but the underlying checkpoint coordinator may immediately fail again (same root cause), leaving the job in a state where:
- Worker tasks run without an active checkpoint coordinator
- No checkpoint barriers are injected
- No checkpoints are written to S3
- The job shows RUNNING in the UI indefinitely

The last written checkpoint on S3 remains frozen at the pre-failure position. On any subsequent master switch or rollout, this stale checkpoint is used for restoration, leading to the binlog-purged error above.

Fix :

In CoordinatorService.restoreJobFromMasterActiveSwitch(), add a terminal state check after the existing null check:

  if (jobState == null) {
      runningJobInfoIMap.remove(jobId);
      return;                                                                                                                                                           
  }
                                                                                                                                                                        
  // NEW: do not restore jobs already in terminal state                                                                                                               
  // coordinator was killed before cleanJob() could remove them from IMap
  if (jobState instanceof JobStatus && ((JobStatus) jobState).isEndState()) {
      logger.warning(String.format(
          "Job %s is in terminal state %s, skipping restore and cleaning up IMap",
          jobId, jobState));
      runningJobInfoIMap.remove(jobId);
      return;
  }


This covers FAILED, CANCELED, FINISHED, SAVEPOINT_DONE, UNKNOWABLE.

Additional observation — missing UI visibility

There is currently no way in the SeaTunnel UI or REST API to see when the last checkpoint completed for a running job (e.g. "last checkpointed 30s ago"). This makes it impossible to detect the silent no-checkpoint scenario described above — the job appears healthy with RUNNING status while checkpointing has completely stopped.

A lastCheckpointTime or secondsSinceLastCheckpoint field exposed via /running-jobs or the UI would allow operators to immediately isolate jobs where checkpointing has silently stopped, rather than discovering the problem only at the next rollout when the stale checkpoint causes a restoration failure.

SeaTunnel Version

SeaTunnel version: 2.3.12 (Zeta engine)
Environment: Kubernetes, S3 checkpoint storage

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

Status

Doing

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions