Skip to content

refactor(ci): rewrite fix-dependabot to capture all CI failures#333

Merged
umair-ably merged 3 commits intomainfrom
refactor/dependabot-workflow-rewrite
Apr 15, 2026
Merged

refactor(ci): rewrite fix-dependabot to capture all CI failures#333
umair-ably merged 3 commits intomainfrom
refactor/dependabot-workflow-rewrite

Conversation

@umair-ably
Copy link
Copy Markdown
Collaborator

Summary

Rewrites the Fix Dependabot PRs workflow from a single job that duplicated build/lint/test internally to a two-job architecture that waits for all CI workflows to complete and captures their failures.

Problem

The previous workflow ran its own build, lint, and unit test steps. When those passed, it assumed everything was fine — but other CI workflows (E2E CLI, Web CLI Playwright E2E, Security Audit) run separately. Their failures were invisible to the Claude fixer.

Example: PR #332 had a React useState duplication bug caught by Playwright E2E tests, but Claude was never invoked because the unit tests passed within the fix-dependabot workflow.

Solution

  • Job 1 (regen-lockfile): Same as before — guard for dependabot PRs, regenerate pnpm-lock.yaml, commit + push. Outputs the HEAD SHA.
  • Job 2 (fix-failures): Polls the GitHub check runs API on the HEAD SHA every 30s, waiting for all other CI workflows to complete. If any fail, fetches their logs via gh run view --log-failed and passes everything to Claude Code Action in one shot.

Key design decisions

  • Removes duplicated work: No more internal build/lint/test steps — we rely on the real CI workflows instead
  • Polls check runs API: Waits for at least 3 of 4 core CI checks (test, e2e-cli, setup, audit) to appear, then waits for all to complete
  • Skips non-CI checks: Filters out own workflow jobs, Vercel deployments, and PR tooling (claude-review, PR overview)
  • 25-minute polling timeout: Leaves ~15 minutes for Claude within the 45-minute job timeout
  • Concurrency group: Prevents duplicate polling when the lockfile push re-triggers this workflow
  • Initial 60s wait: Gives CI checks time to be queued after the lockfile push

Test plan

  • Verify the workflow triggers correctly on a dependabot PR
  • Verify the polling correctly waits for and detects CI check completions
  • Verify failed check logs are collected and passed to Claude
  • Verify the concurrency group cancels stale runs when re-triggered

🤖 Generated with Claude Code

Instead of duplicating build/lint/test steps internally, the workflow
now polls the GitHub check runs API to wait for all other CI workflows
(unit tests, E2E CLI, Web CLI E2E, security audit) to complete, then
collects failure logs and passes them to Claude in one shot.

This fixes the gap where Playwright E2E failures (e.g., React useState
duplication from dependency bumps) were invisible to the Claude fixer.

Structure:
- Job 1 (regen-lockfile): guard + regen pnpm-lock.yaml + push
- Job 2 (fix-failures): poll check runs API, collect failures, invoke Claude

Also adds concurrency group to prevent duplicate polling when the
lockfile push re-triggers the workflow.
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
cli-web-cli Ready Ready Preview, Comment Apr 15, 2026 6:06pm

Request Review

@claude-code-ably-assistant
Copy link
Copy Markdown

Walkthrough

This PR rewrites the Fix Dependabot PRs CI workflow from a single self-contained job that ran its own build/lint/test steps into a two-job architecture. Job 1 regenerates the lockfile and captures the HEAD SHA; Job 2 polls the GitHub check-runs API until all other CI workflows complete, then collects failure logs and passes them to Claude Code Action for repair. The motivation is that the old workflow missed failures in separately-running workflows (E2E CLI, Playwright Web CLI, security audit) — as demonstrated by PR #332 where a React bug slipped through.

Changes

Area Files Summary
Config / CI .github/workflows/dependabot-lockfile.yml Full rewrite: split into regen-lockfile + fix-failures jobs; replace internal build/lint/test steps with polling the check-runs API; add concurrency group and checks: read permission

Review Notes

  • Behavioral change: The workflow no longer runs pnpm install, pnpm build, pnpm exec eslint ., or pnpm test:unit itself — it relies entirely on the existing CI workflows. If a check is added/renamed in CI, the EXPECTED_CHECKS array and SKIP_PATTERN regex in the polling step may need updating.
  • New permission: checks: read added at the workflow level to allow polling the check-runs API.
  • Concurrency group: cancel-in-progress: true cancels stale polling runs when the lockfile push re-triggers the workflow — reviewers should confirm this is the desired behaviour (i.e., only the latest run should attempt fixes).
  • Timeout budget: fix-failures job has a 45-minute timeout, with 25 minutes reserved for polling and ~15 minutes left for Claude. If any CI workflow routinely takes >25 minutes, the poller will time out with a warning rather than an error — Claude won't be invoked.
  • Skip logic for checks: The SKIP_PATTERN regex filters out regen-lockfile, fix-failures, Vercel, claude-review, and Generate PR Overview. Any new PR-level check added in future should be added here to avoid it blocking the poller.
  • Log truncation: Each failed workflow's logs are capped at tail -n 500 lines before being written to $GITHUB_OUTPUT. Very long failure outputs will be silently truncated.
  • actions/create-github-app-token@v3: Introduced in fix-failures (not present in the old single job). Ensure CI_APP_ID and CI_APP_PRIVATE_KEY secrets are available in the repo/org scope.
  • No source-code changes: All modifications are confined to the single workflow YAML file — no TypeScript, tests, or docs affected.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@claude-code-ably-assistant claude-code-ably-assistant bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

Architecture: solid. Moving from duplicated internal build/lint/test to polling real CI results directly solves the problem from PR #332. The two-job split, concurrency group, and filtering logic are all well-thought-out.

Two correctness bugs to address before merging.


Bug 1: Heredoc delimiter collision (medium risk)

The failure_logs and failure_summary step outputs use fixed heredoc delimiters (ENDOFLOGS, ENDOFFAILURES) to write multi-line values to $GITHUB_OUTPUT. The content of failure_logs comes from raw gh run view --log-failed output. If any CI log line contains exactly ENDOFLOGS, GitHub Actions closes the multiline value early and Claude receives a truncated log missing the actual failure cause.

Fix - use a randomised delimiter that cannot appear in log output:

delimiter="EOF_$(openssl rand -hex 16)"
{ echo "failure_logs<<${delimiter}"; echo "$failure_logs"; echo "${delimiter}"; } >> "$GITHUB_OUTPUT"

Apply the same pattern to failure_summary.


Bug 2: Silent non-action when the API fails during polling

If every API call inside the polling loop fails (transient outage, rate limit, permissions), each iteration takes the continue path and ci_checks is never populated. After the timeout break, the failure-collection code runs against an empty ci_checks variable: failed_count=0, and the step exits 0 with the message "All checks passed! Nothing to fix."

Real CI failures are silently missed - the job exits green without invoking Claude.

Fix - after the loop, fail explicitly when timing out with no check data received:

if [[ $elapsed -ge $MAX_POLL_TIME && -z "$ci_checks" ]]; then
echo "::error::Timed out waiting for CI checks - no check data received"
exit 1
fi


Minor (no change needed): The failure_logs prompt expansion is subject to the standard GHA }} sequence issue if logs contain that string, but this was also present in the old workflow.

- Use randomised EOF delimiters for GITHUB_OUTPUT heredocs to prevent
  collision with raw CI log content truncating the output early
- Fail explicitly (exit 1) when the polling loop times out without
  ever receiving check data, instead of silently reporting success
- Add pnpm/Node.js setup to fix-failures job so Claude can run
  build/lint/test commands (critical — was missing entirely)
- Use Vercel.* prefix match in SKIP_PATTERN for resilience
- Add generate-overview fallback to SKIP_PATTERN
- Include cancelled checks in failure detection
- Add run URL to failure logs for manual inspection
- Log pending checks at polling timeout for debugging
- Add SHA context logging
- Default failed_count=0 at step start
- Document EXPECTED_CHECKS source workflows
@sacOO7
Copy link
Copy Markdown
Contributor

sacOO7 commented Apr 15, 2026

Review feedback:

  1. Add pnpm/Node.js setup to fix-failures job
  2. Replace setup with a real test check in EXPECTED_CHECKS
  3. Make SKIP_PATTERN more resilient with prefix matching

Copy link
Copy Markdown
Contributor

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@umair-ably umair-ably merged commit 41e662e into main Apr 15, 2026
8 of 9 checks passed
@umair-ably umair-ably deleted the refactor/dependabot-workflow-rewrite branch April 15, 2026 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants