- @fffrog
- @can-gaa-hou
- @KarhouTam
This RFC proposes a cross-repo CI coordination mechanism based on a GitHub App and a Relay Server. The system allows specific events in the PyTorch repository to automatically trigger CI jobs in downstream repos. Based on configuration, results can optionally be reported back to the PyTorch PR check list or HUD, with support for hints, PR blocking, re-runs, and update notifications. This provides strong compatibility guarantees between PyTorch and its downstream repositories.
Note
- This design does not guarantee 100% event delivery.
- Each downstream repo that uses this mechanism is the primary owner of its own CI failures.
This design involves four core components:
- GitHub App: The authentication and decoupling hub. It declares required permissions, event subscriptions, and handles authorization.
- PyTorch Repo: The upstream event source. It authorizes external communication through the GitHub App.
- Relay Server: The core dispatcher. It handles Webhook events from GitHub and callback requests from downstream repos.
- Downstream Repos: The repos that actually run the tests (out-of-tree backend accelerator plugins).
The overall flow and component responsibilities are shown below:
sequenceDiagram
actor Dev as Developer
participant PT as PyTorch
box rgba(36,41,47,0.08) GitHub Platform
participant APP as GitHub App<br/>(Auth Hub)
participant API as GitHub API
end
box rgba(0,118,255,0.08) Relay Server
participant GW as AWS API Gateway
participant LM as AWS Lambda
participant CH as ClickHouse
end
participant DS as Downstream Repo
Dev->>PT: Create/Update PR / Re-run Check
PT->>APP: Webhook Event<br/>(PR opened/reopened/synchronize, Check rerequested)
APP->>GW: POST /relay-webhook<br/>(event payload)
GW->>LM: Invoke Relay Handler
Note over LM: Verify X-Hub-Signature-256<br/>(Webhook Secret)
Note over LM: Read YAML allowlist<br/>from test-infra repo
LM->>APP: Sign JWT with Private Key
APP-->>LM: JWT
LM->>API: Exchange JWT for Installation Access Token
API-->>LM: Installation Access Token
LM->>API: Query repos with App installed
API-->>LM: Installed repo list
Note over LM: Intersect installed repos ∩ allowlist (forward_event=true)<br/>→ eligible downstream repos
loop For each eligible downstream repo
alt L4 OR (L3 AND PR has ciflow/oot/<name> label)
LM->>API: Create in_progress Check Run
API->>PT: PR Checks show "in progress"
end
LM->>API: repository_dispatch + payload (fire-and-forget)
API->>DS: Trigger Workflow
end
Note over LM: (Optional) Write dispatch log to ClickHouse
Note over DS: Pull PR code → Build → Test
DS->>GW: POST /relay-callback<br/>(conclusion + details URL)
GW->>LM: Invoke Callback Handler
Note over LM: Verify callback payload
alt display_on_hud=true (L2+)
LM->>CH: Write to oot_ci_results table (OOT HUD page)
end
alt L4 OR (L3 AND ciflow/oot/<name> label present)
LM->>API: Request PyTorch Installation Token (via App identity)
API-->>LM: Token
alt L4 (block_pr=true)
LM->>API: Update Check Run (blocking, failure blocks merge)
else L3 with label (display_on_pr=label_only)
LM->>API: Update Check Run (informational, non-blocking)
end
API->>PT: PR Checks updated with result
end
The GitHub App is the authentication and decoupling hub of the entire design. It plays three key roles:
- Unified identity: The Relay Server calls the GitHub API as the GitHub App identity (not a personal PAT). The permission lifecycle is independent of any personal account, making it secure and manageable.
- Event bridge: Through Webhook subscriptions, GitHub pushes PR events to the Relay Server in real time, driving the entire automation flow.
- Upstream/downstream decoupling: Downstream repos only need to install this App to join the cross-repo CI coordination. Downstream repos do not need an upstream token, and the upstream does not need to know about the downstream. All interactions are bridged through the GitHub App and Relay Server.
Note
This GitHub App should be created under the pytorch organization and owned by the PyTorch team or the LF AI & Data Foundation team, to ensure credibility. An App created by a third party will face trust issues during installation and adoption.
Following the principle of least privilege, the full set of permissions and event subscriptions required by the GitHub App is as follows:
| Category | Permission | Level | Purpose |
|---|---|---|---|
| Repository | Actions | Read & Write | Re-run downstream CI workflows on demand; when a developer re-requests a Check Run on a GitHub PR page, the Relay Server can re-dispatch the OOT job |
| Repository | Checks | Read & Write | Create/update Check Runs on PyTorch PRs to relay downstream CI status and conclusions |
| Repository | Contents | Read & Write | Read repo content and config files; send repository_dispatch events to downstream repos to trigger CI workflows |
| Repository | Pull requests | Read-only | Read PR metadata (branch, SHA, author, etc.) to build the downstream payload |
| Repository | Metadata | Read-only | Required base permission for GitHub Apps |
| Event | Purpose |
|---|---|
| Pull request | Trigger downstream CI when a PR is opened or reopened; clean up when a PR is closed |
| Check suite | Handle re-run requests for all Check Runs within a Check Suite |
| Check run | Handle re-run requests for a single Check Run |
| Config | Description |
|---|---|
| Webhook URL | The public endpoint of the Relay Server. GitHub pushes all subscribed events here. |
| Webhook Secret | A high-entropy random string. The Relay Server uses it to verify the request source and prevent forgery. |
| Private Key | Download the .pem file after creation. The Relay Server uses it to generate a JWT for calling the GitHub API. Must be stored securely and never committed to a code repository. |
| Installation Scope | Must be set to "Any account" (public installation). Downstream repos may belong to different organizations, and only a public installation can support cross-organization coordination. |
After the GitHub App is created, a PyTorch repo Maintainer opens the App installation link in a browser, selects the PyTorch repository, and confirms the installation.
After a downstream repo receives the repository_dispatch event from the Relay Server, typical CI workflow steps are:
- Pull the PyTorch PR source code based on PR info in the payload
- Build PyTorch
- Check out the downstream repo (using the existing
checkoutaction) - Build the downstream plugin
- Run tests
- Report results to the Relay Server
Step 1 and Step 6 are common needs shared by all downstream repos, and the logic is identical. To avoid requiring downstream repos to understand the Relay Server integration details (such as callback URLs and auth tokens), and to standardize the result reporting format, this proposal suggests creating two reusable Actions in the PyTorch repo for all downstream repos to reference directly.
Checks out the PyTorch PR source code into the downstream workspace, based on PR metadata (PR number, head SHA, head ref, etc.) in the repository_dispatch event payload. This Action encapsulates all the logic for checking out PyTorch code. Downstream repos can use it directly without parsing the payload or running git commands themselves.
Called as the last step of the downstream CI workflow to report standardized execution results to the Relay Server. This Action defines a structured format with required fields (e.g., conclusion, details_url) and optional fields (e.g., summary), ensuring all downstream repos output consistent, machine-readable results. The Relay Server then decides, based on the allowlist policy, whether to report the result back to the PyTorch PR as a Check Run.
Note
This structured data is the source for the downstream repo's dedicated HUD page. It can be dynamically extended based on downstream repo requirements, including run results, downstream CI infrastructure status, and more.
The Relay Server is the core dispatcher. It handles two types of incoming requests and applies the allowlist policy:
- PR events from PyTorch (Webhook push): Trigger CI in downstream repos
- Callbacks from downstream repos (HTTP callback): Relay CI results back to the upstream PR
The allowlist is the core control mechanism of the Relay Server. It determines how deeply each downstream repo participates. The Relay Server maintains the following fields for each downstream repo (repos not explicitly configured use default values):
| Field | Type | Description | Default |
|---|---|---|---|
forward_event |
bool | Controls whether upstream PR events are forwarded to this downstream repo. Acts as a gate to prevent abuse from unauthorized large-scale installations. | false |
display_on_hud |
bool | Controls whether downstream CI results are shown on the dedicated OOT HUD page (hud.pytorch.org/oot/[org]/[repo]). |
false |
display_on_pr |
string | Controls whether downstream CI results can be shown as a Check Run in the PR Checks list. Options: always, label_only, false |
false |
block_pr |
bool | Controls whether the Check Run shown in PR Checks can block PR merges. | false |
Based on these four fields, four participation levels are defined:
| Level | forward_event | display_on_hud | display_on_pr | block_pr | Description |
|---|---|---|---|---|---|
L1 |
true |
false |
false |
false |
Events are forwarded to downstream, but upstream receives no downstream feedback. |
L2 |
true |
true |
false |
false |
Downstream results are shown on the dedicated HUD page by default. |
L3 |
true |
true |
label_only |
false |
Adds a non-blocking Check Run for the device in PR Checks (only triggered when the ciflow/oot/<name> label is added). |
L4 |
true |
true |
always |
true |
Adds a blocking Check Run for the device in PR Checks (auto-triggered for every PR); reserved for critical accelerators only. |
Note
- Repos not listed in the allowlist YAML default to: no downstream CI triggered, no results forwarded, and no PR merges blocked.
- For L3, the
ciflow/oot/<name>label permission is enforced by the existing pytorchbot mechanism in PyTorch, which already prevents unauthorized users (non-PR authors or Maintainers) from addingciflow/oot/<name>labels to a PR.
Level management is based on a YAML config file. Downstream repo developers who meet the requirements in the Evolution Path section can submit a PR with supporting materials to modify this config file. Community Maintainers/TAC will review and merge it.
L1:
- org1/repo1
- org2/repo2
L2:
- org3/repo3
L3:
- org4/repo4
L4:
- org5/repo5: @oncall1,oncall2The allowlist is designed to naturally support gradual progression from experimental participation to mature participation, downstream repos that meet the requirements can apply to advance level by level (L1 → L2 → L3 → L4). The table below lists the requirements for advancing to each level.
| Phase | Level | Requirements |
|---|---|---|
| Onboarding | L1 |
1. GitHub App installed 2. Provide verifiable accelerator hardware information 3. Provide a downstream adaptation repo for the accelerator |
| Observation | L2 |
1. Follow the standard Workflow Configuration to receive events and report results 2. Must not send excessive or invalid requests to the Relay Server |
| Stable | L3 |
1. CI infrastructure must keep job queue time under X min per workflow 2. CI infrastructure must keep total run time under X hour per PR 3. Weekly CI success rate > X % (including both infra failures and test failures) |
| Mature | L4 |
Fully determined by Core Maintainer, considering factors including but not limited to: - community adoption, - hardware usage, - test coverage (whether the PyTorch core test suite is required, @mikaylagawarecki), - test pass rate, - oncall responsiveness, etc. |
Note
- The requirements above are an initial reference and may be adjusted over time based on real-world conditions (e.g., determining the specific values of
X). - To maintain the PyTorch community's user experience, downstream repos that no longer meet the requirements of their current level will be downgraded to the level that matches their actual status.
L3is the recommended long-term target for most downstream repos, as it provides a good balance between signal depth and minimal negative impact on upstream.L4: Only applies to a small number of downstream repos. Detailed requirements will be defined before any backend approaches theL4bar.
After the GitHub App is created, downstream repos that want to receive PyTorch PR events open the App installation link in a browser, select the target repo or organization, and confirm. This completes the downstream repo's authorization of the GitHub App.
Downstream developers follow the requirements in the Evolution Path section and submit a PR to the test-infra repo with supporting evidence (e.g., accelerator details, CI reliability metrics). After a community Maintainer reviews and merges it, the Relay Server will forward events according to the updated config.
Downstream backend CI workflows share a similar structure: pull upstream code, build, test, and report results. Below is a reference template showing the key steps. Each backend can add intermediate steps as needed.
name: PyTorch CI
on:
repository_dispatch:
types: [pytorch-pr-trigger]
concurrency:
group: upstream-pr-${{ github.event.client_payload.pr_number }}
cancel-in-progress: true
jobs:
upstream-ci:
runs-on: [self-hosted]
steps:
# Step 1: Report the startup status to the relay server so that a check run with a status of "In Progress" can be created for the PR
- name: Report startup to Relay Server
uses: pytorch/actions/report-ci-result@v1
if: always() # Must report regardless of success or failure
with:
conclusion: ${{ job.status }}
url: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
# other parameters...
# Step 2: Check out PyTorch PR code
- name: Checkout PyTorch PR
uses: pytorch/actions/checkout-pr@v1
with:
head-sha: ${{ github.event.client_payload.head_sha }}
path: pytorch # Check out to ./pytorch directory
# Step 3: Build PyTorch
- name: Build PyTorch
working-directory: pytorch
run: |
pip3 install -vvv --no-build-isolation -e . # Example only
# Step 4: Check out downstream repo
- name: Checkout downstream repo
uses: actions/checkout@v4
with:
path: backend
# Step 5: Build backend plugin
- name: Build backend
working-directory: backend
run: |
python setup.py develop
# Step 6: Run tests
- name: Test backend
run: |
pytest tests/
# Step 7: Report results to Relay Server
- name: Report CI result
uses: pytorch/actions/report-ci-result@v1
if: always() # Must report regardless of success or failure
with:
conclusion: ${{ job.status }}
url: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
# other parameters...The PyTorch CI HUD (hud.pytorch.org) is a CI status dashboard maintained by the PyTorch team. It shows all CI job runs for every commit and PR. To minimize the impact of downstream repo results on the PyTorch HUD, this proposal introduces the following HUD integration strategy:
- Dedicated OOT repo page (
hud.pytorch.org/oot/[org]/[repo]): Available from L2 onwards. Downstream CI results are shown on a dedicated page for the downstream repo. The layout is similar to the main HUD page, but focused only on the test history of a single downstream repo. It gives OOT Maintainers a self-service CI health dashboard without affecting any upstream views. - Global OOT CI summary page: Primarily for PyTorch CI Maintainers. Provides a global view of OOT CI health across all repos. Makes it easy to spot widespread OOT infrastructure issues or identify repos that may need to be downgraded.
- HUD PR view (
hud.pytorch.org/pr/<number>): For L3/L4 repos, OOT check results are displayed in the PR-level HUD view under a dedicated "Out-of-Tree Backends" section, grouped alongside standard CI results. This ensures developers and Maintainers can see OOT CI status in their normal PR review workflow without switching to a separate page. - Main HUD page: Only shows
L4(required) Check Runs, ensuring the main dashboard stays focused on signals that every PyTorch contributor needs to care about.
The Relay Server writes all results from L2 and above into ClickHouse, which powers the dedicated OOT HUD pages described above.
The design described in the Architecture section processes webhook events and downstream fan-out synchronously within a single Lambda invocation. While straightforward, this means that if the Lambda encounters a transient failure mid-dispatch (e.g., GitHub API rate limit spike, network timeout), some downstream repos may miss the event with no built-in retry mechanism. This RFC therefore proposes a more resilient alternative architecture:
The webhook handler only performs signature verification and task persistence (writing dispatch tasks to DynamoDB), returning a 202 Accepted immediately. DynamoDB Streams then trigger separate Lambda invocations for asynchronous fan-out, providing built-in retry semantics, per-task isolation, and reliable at-least-once delivery to all eligible downstream repos.
sequenceDiagram
actor Dev as Developer
participant PT as PyTorch
box rgba(36,41,47,0.08) GitHub Platform
participant APP as GitHub App<br/>(Auth Hub)
participant API as GitHub API
end
box rgba(0,118,255,0.08) Relay Server
participant V as Vercel API Route
participant DB as DynamoDB
participant LM as AWS Lambda
participant CH as ClickHouse
end
participant DS as Downstream Repo
Dev->>PT: Create/Update PR / Re-run Check
PT->>APP: Webhook Event<br/>(PR opened/reopened/synchronize, Check rerequested)
APP->>V: Webhook push → /api/relay/webhook<br/>(event payload)
Note over V: Verify X-Hub-Signature-256<br/>(Webhook Secret)
V->>DB: Write to relay-dispatch-tasks
V-->>APP: Return 202 Accepted (< 3s)
Note over LM: Asynchronous Fan-out (Max 15 min)
DB->>LM: DynamoDB Streams trigger
LM->>APP: Sign JWT with Private Key
APP-->>LM: JWT
LM->>API: Exchange JWT for Installation Access Token
API-->>LM: Installation Access Token
LM->>API: Query repos with App installed
API-->>LM: Installed repo list
Note over LM: Intersect installed repos ∩ allowlist (forward_event=true)<br/>→ eligible downstream repos
loop For each eligible downstream repo
alt L4 OR (L3 AND PR has ciflow/oot/<name> label)
LM->>API: Create in_progress Check Run
API->>PT: PR Checks show "in progress"
end
LM->>API: repository_dispatch + payload
API->>DS: Trigger Workflow
end
Note over DS: Pull PR code → Build → Test
DS->>V: Callback → /api/relay/callback<br/>(conclusion + details URL)
Note over V: Verify callback payload
alt display_on_hud=true (L2+)
V->>CH: Write to oot_ci_results table (OOT HUD page)
end
alt L4 OR (L3 AND ciflow/oot/<name> label present)
V->>API: Request PyTorch Installation Token (via App identity)
API-->>V: Token
alt L4 (block_pr=true)
V->>API: Update Check Run (blocking, failure blocks merge)
else L3 with label (display_on_pr=label_only)
V->>API: Update Check Run (informational, non-blocking)
end
API->>PT: PR Checks updated with result
end
An end-to-end prototype has been completed. A few key points are noted below.
- Upstream repo — Checks overview
- To avoid cluttering the PyTorch PR Checks page, the naming convention for Check Runs is defined as
oot / DEVICE / workflow detail. With this naming, related checks are automatically grouped alphabetically in the UI. - By configuring
concurrencyin the downstream workflow, when a PR is updated, the old Action is cancelled and a new one runs. The upstream Check Run is updated accordingly.
- Upstream repo — Check Run details
- The right-hand detail panel can be customized to show the
downstream Action link,run result, andkey errors(can be extended later as needed). - Clicking
Re-runorRe-run checkswill trigger the downstream Action to re-run, and the new result will automatically sync back to the upstream Check Run.
- Webhook signature verification: The Relay Server must verify the
X-Hub-Signature-256header on every Webhook request to confirm the request comes from GitHub and prevent third-party event forgery. - Callback authentication: When a downstream repo sends a callback result to the Relay Server, the server must verify the legitimacy of the request (e.g., validate the PR / Check Run ID in the payload) to prevent malicious repos from injecting fake CI results into upstream PRs.
- Token expiry: GitHub App Installation Access Tokens are valid for 1 hour. The Relay Server should request them on demand and discard them immediately after use, with no long-term caching, to minimize the impact window of a token leak.