Add neuron-dra e2e test cases with DRA driver and dranet manifests by nakshah87 · Pull Request #782 · aws/aws-k8s-tester

nakshah87 · 2026-04-09T09:24:09Z

Issue #, if available:

Add Neuron DRA e2e test suite

This PR adds end-to-end tests for the Neuron DRA (Dynamic Resource Allocation) driver, validating that Neuron accelerator
and EFA RDMA devices are correctly allocated to multi-node MPI workloads via the Kubernetes DRA framework.

What's included

Test framework (test/cases/neuron-dra/)

main_test.go — Test harness that orchestrates setup/teardown of all dependencies: MPI operator, dranet DaemonSet,
ResourceClaimTemplates, and the Neuron DRA driver (installed via Helm). Dynamically builds the manifest and setup
function list based on the instance family's RDMA type.
neuron_dra_test.go — Data-driven test runner that discovers test cases from embedded YAML files, computes MPIJob
parameters from ResourceClaimTemplate specs, and runs both positive (MPIJob succeeds) and negative (workers remain
Pending) assertions.
topology.go — Instance topology registry mapping instance families (trn1, trn2) to their Neuron core counts, RDMA
types, and test case directories. Also contains all parsing, parameter computation, and MPIJob template rendering logic.

MPIJob template (templates/nccom-test-mpijob.yaml.tmpl)

Go template for rendering MPIJob manifests. Parameterized on slots-per-worker, total ranks, worker replicas, container
image, and resource claims. The launcher runs nccom-test across all workers; workers expose SSH and configure RDMA
networking.

Test cases (testcases/trn1/)

all-efas-all-neurons.yaml — Positive test: allocate all Neuron devices and all EFA interfaces per node.
2-efas-4-neurons-wrong-match.yaml — Negative test: request a mismatched device group constraint that should be
unschedulable.

ResourceClaimTemplates (rcts/trn1/)

rct-all-efas-all-neurons.yaml — Claims all Neuron and EFA devices with allocationMode: All.
rct-2-efas-4-neurons-wrong-match.yaml — Claims 4 Neurons + 2 EFAs with an intentionally wrong matchAttribute
constraint.
Shared helpers (test/common/dra.go)
DeployDranet() — Renders the dranet manifest template, applies it, and waits for the DaemonSet to be ready.
DeployMPIOperator() — Applies the MPI operator manifest and waits for the Deployment to become available.

Infrastructure

test/manifests/assets/dranet.yaml — Full dranet DaemonSet manifest (ClusterRole, ServiceAccount, DaemonSet) with a
templated image field.
test/manifests/assets/mpi-operator.yaml — Updated MPI operator manifest consumed by DeployMPIOperator().
test/manifests/raw.go — Registers the new dranet manifest embed.
Dockerfile — Adds Helm v3.17.3 to the test image (required for Neuron DRA driver installation).

How it works

The test harness deploys the MPI operator, dranet (for EFA-based families), RCTs, and the Neuron DRA driver.
Test cases are YAML files that reference RCTs by name. The framework resolves each RCT to compute neuron core counts,
slots-per-worker, and total MPI ranks.
An MPIJob is rendered from the nccom-test-mpijob.yaml.tmpl Go template and applied to the cluster.
Positive tests wait for the MPIJob to succeed; negative tests assert that worker pods remain Pending.
Teardown uninstalls the Helm release and deletes all applied manifests in reverse order.

Adding new test cases

To add a test for a new device/EFA combination:

Add a ResourceClaimTemplate YAML under rcts//.
Add a test case YAML under testcases// referencing the RCT name. Set expectFailure: true for negative tests.

To add support for a new instance family, add an entry to instanceTopologies in topology.go.

Testing:
Tested the neuron-dra.test on my cluster and verified that the tests are running fine.

The test can be invoked as follows:

./neuron-dra.test --test.timeout=60m \         
  --test.v \                  
  -rdmaDeviceDraDriverImage=<dranet-image-uri> \
  -acceleratorDraDriverImage=<neuron-dra-image-uri> \
  -containerTestImage=<container-image-uri> \
  -nodeType=<instance-type>

The acceleratorDraDriverImage flag is optional. If not provided, it installs using the image present in the helm chart.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

shvbsle

Good PR. Almost there!

shvbsle · 2026-04-14T16:43:42Z

+      template:
+        metadata:
+          annotations:
+            sidecar.istio.io/inject: "false"


Is this left over from local testing?

Will remove it.

shvbsle · 2026-04-14T16:44:02Z

+      template:
+        metadata:
+          annotations:
+            sidecar.istio.io/inject: "false"


same here. Why is this annotation needed?

Will remove it.

shvbsle · 2026-04-14T16:55:38Z

+	nodeType                   *string
+	rdmaDeviceDraDriverImage   *string
+	acceleratorDraDriverImage  *string
+	containerTestImage            *string


Can you run your code through gofmt so that block uses column-aligned types?

shvbsle · 2026-04-14T17:17:49Z

+	}
+
+	// Install Neuron DRA driver via Helm chart.
+	setUpFunctions = append(setUpFunctions,


Nit: These setup functions will execute sequentially (blocking calls) but a few of the steps don't have a dependency on each other (setting up mpi operator, deploying dranet and helm install of neuron dra driver) so you could also kick them off concurrently. I'll let you take the call here.

Good idea. I will change this to run concurrently.

shvbsle · 2026-04-14T17:25:13Z

+			continue
+		}
+		ext := filepath.Ext(entry.Name())
+		if ext != ".yaml" && ext != ".yml" {


Nit: could do extract into a common function and reuse in loadRCTIndex and loadRCTManifests

func isYAMLFile(name string) bool { ext := filepath.Ext(name) return ext == ".yaml" || ext == ".yml" }

Also loadRCTIndex and loadRCTManifests seems to have an overlap in the work that they do. Might be worth thinking of having a single function to avoid too much code duplication.

loadRCTIndex and loadRCTManifests fundamentally have different functionalities.
The way the test runs is:

we apply all RCT manifests for a given instance before the test begins. loadRCTManifests returns a list of all these manifests.

As part of the each test, a RCT is added as a resource to the MPIJob's worker container. Which resource needs to be added is defined in testcases folder. To calculate how many neurons are required for the test, we require a map for each RCT which loadRCTIndex provides.

Separation of these 2 functionalities will be easy to read the code instead of merging them.

Made change to introduce isYAMLFile function.

makes sense to me. Thanks!

shvbsle · 2026-04-14T17:26:57Z

+
+func deployNeuronDRADriver(ctx context.Context, config *envconf.Config) (context.Context, error) {
+	ds := appsv1.DaemonSet{
+		ObjectMeta: metav1.ObjectMeta{Name: "neuron-dra-driver-kubelet-plugin", Namespace: "neuron-dra-driver"},


We already have a const neuronDRANamespace above for this. so we can remove the hardcoding

shvbsle

lgtm

junpengdev · 2026-04-15T04:28:40Z

+// loadRCTManifests reads all RCT YAML files for the given instance family from
+// the embedded filesystem and returns them as raw byte slices suitable for
+// fwext.ApplyManifests.
+func loadRCTManifests(family string) ([][]byte, error) {


Shouldn't the parameter here be named nodeType instead of family?

You are right. It is the nodeType that is passed and hence functionally it is working but the var name is confusing.
Will change that.

junpengdev · 2026-04-15T04:35:05Z

+spec:
+  slotsPerWorker: {{.SlotsPerWorker}}
+  runPolicy:
+    backoffLimit: 20


Is this too high?

I took reference from the neuron job manifest

Basically the launcher pod starts before the worker pods' DNS records are ready, causing it to crash and restart in a loop.
The launcher tries to SSH into the workers via the headless service DNS when it comes up(multi-node-nccl-test-worker-{0,1}.multi-node-nccl-test.default.svc), and if the workers are not running yet then they haven't registered their DNS entries yet. This causes the launcher pod to restart via CrashLoopBackOff, and eventually the DNS propagates and it succeeds.

I have seen the launcher pod restarting 2-3 times when the node count is 2. I think this would change with a higher node count.

Let me know if you want to change this value to something lower.

nakshah87 force-pushed the neuron-dra branch 3 times, most recently from be4bc62 to b8a8d5e Compare April 13, 2026 04:57

shvbsle reviewed Apr 14, 2026

View reviewed changes

nakshah87 force-pushed the neuron-dra branch from b8a8d5e to 33d2a0d Compare April 14, 2026 19:08

shvbsle requested review from junpengdev and mselim00 April 14, 2026 21:23

shvbsle approved these changes Apr 14, 2026

View reviewed changes

junpengdev reviewed Apr 15, 2026

View reviewed changes

Comment thread test/cases/neuron-dra/topology.go Outdated

junpengdev reviewed Apr 15, 2026

View reviewed changes

Add neuron-dra e2e test cases with DRA driver and dranet manifests

7b403f7

nakshah87 force-pushed the neuron-dra branch from 33d2a0d to 7b403f7 Compare April 15, 2026 21:13

aws deleted a comment from junpengdev Apr 15, 2026

aws deleted a comment from nakshah87 Apr 15, 2026

shvbsle merged commit 3c57d64 into aws:main Apr 15, 2026
4 checks passed

Conversation

nakshah87 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shvbsle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shvbsle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nakshah87 commented Apr 9, 2026 •

edited

Loading