Skip to content

Add neuron-dra e2e test cases with DRA driver and dranet manifests#782

Merged
shvbsle merged 1 commit intoaws:mainfrom
nakshah87:neuron-dra
Apr 15, 2026
Merged

Add neuron-dra e2e test cases with DRA driver and dranet manifests#782
shvbsle merged 1 commit intoaws:mainfrom
nakshah87:neuron-dra

Conversation

@nakshah87
Copy link
Copy Markdown
Contributor

@nakshah87 nakshah87 commented Apr 9, 2026

Issue #, if available:

Add Neuron DRA e2e test suite

This PR adds end-to-end tests for the Neuron DRA (Dynamic Resource Allocation) driver, validating that Neuron accelerator
and EFA RDMA devices are correctly allocated to multi-node MPI workloads via the Kubernetes DRA framework.

What's included

Test framework (test/cases/neuron-dra/)

  • main_test.go — Test harness that orchestrates setup/teardown of all dependencies: MPI operator, dranet DaemonSet,
    ResourceClaimTemplates, and the Neuron DRA driver (installed via Helm). Dynamically builds the manifest and setup
    function list based on the instance family's RDMA type.
  • neuron_dra_test.go — Data-driven test runner that discovers test cases from embedded YAML files, computes MPIJob
    parameters from ResourceClaimTemplate specs, and runs both positive (MPIJob succeeds) and negative (workers remain
    Pending) assertions.
  • topology.go — Instance topology registry mapping instance families (trn1, trn2) to their Neuron core counts, RDMA
    types, and test case directories. Also contains all parsing, parameter computation, and MPIJob template rendering logic.

MPIJob template (templates/nccom-test-mpijob.yaml.tmpl)

  • Go template for rendering MPIJob manifests. Parameterized on slots-per-worker, total ranks, worker replicas, container
    image, and resource claims. The launcher runs nccom-test across all workers; workers expose SSH and configure RDMA
    networking.

Test cases (testcases/trn1/)

  • all-efas-all-neurons.yaml — Positive test: allocate all Neuron devices and all EFA interfaces per node.
  • 2-efas-4-neurons-wrong-match.yaml — Negative test: request a mismatched device group constraint that should be
    unschedulable.

ResourceClaimTemplates (rcts/trn1/)

  • rct-all-efas-all-neurons.yaml — Claims all Neuron and EFA devices with allocationMode: All.

  • rct-2-efas-4-neurons-wrong-match.yaml — Claims 4 Neurons + 2 EFAs with an intentionally wrong matchAttribute
    constraint.
    Shared helpers (test/common/dra.go)

  • DeployDranet() — Renders the dranet manifest template, applies it, and waits for the DaemonSet to be ready.

  • DeployMPIOperator() — Applies the MPI operator manifest and waits for the Deployment to become available.

Infrastructure

  • test/manifests/assets/dranet.yaml — Full dranet DaemonSet manifest (ClusterRole, ServiceAccount, DaemonSet) with a
    templated image field.
  • test/manifests/assets/mpi-operator.yaml — Updated MPI operator manifest consumed by DeployMPIOperator().
  • test/manifests/raw.go — Registers the new dranet manifest embed.
  • Dockerfile — Adds Helm v3.17.3 to the test image (required for Neuron DRA driver installation).

How it works

  1. The test harness deploys the MPI operator, dranet (for EFA-based families), RCTs, and the Neuron DRA driver.
  2. Test cases are YAML files that reference RCTs by name. The framework resolves each RCT to compute neuron core counts,
    slots-per-worker, and total MPI ranks.
  3. An MPIJob is rendered from the nccom-test-mpijob.yaml.tmpl Go template and applied to the cluster.
  4. Positive tests wait for the MPIJob to succeed; negative tests assert that worker pods remain Pending.
  5. Teardown uninstalls the Helm release and deletes all applied manifests in reverse order.

Adding new test cases

To add a test for a new device/EFA combination:

  1. Add a ResourceClaimTemplate YAML under rcts//.
  2. Add a test case YAML under testcases// referencing the RCT name. Set expectFailure: true for negative tests.

To add support for a new instance family, add an entry to instanceTopologies in topology.go.

Testing:
Tested the neuron-dra.test on my cluster and verified that the tests are running fine.

The test can be invoked as follows:

./neuron-dra.test --test.timeout=60m \         
  --test.v \                  
  -rdmaDeviceDraDriverImage=<dranet-image-uri> \
  -acceleratorDraDriverImage=<neuron-dra-image-uri> \
  -containerTestImage=<container-image-uri> \
  -nodeType=<instance-type>

The acceleratorDraDriverImage flag is optional. If not provided, it installs using the image present in the helm chart.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@nakshah87 nakshah87 force-pushed the neuron-dra branch 3 times, most recently from be4bc62 to b8a8d5e Compare April 13, 2026 04:57
Copy link
Copy Markdown
Contributor

@shvbsle shvbsle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR. Almost there!

template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this left over from local testing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. Why is this annotation needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

Comment thread test/cases/neuron-dra/templates/nccom-test-mpijob.yaml.tmpl
Comment thread test/cases/neuron-dra/main_test.go Outdated
nodeType *string
rdmaDeviceDraDriverImage *string
acceleratorDraDriverImage *string
containerTestImage *string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run your code through gofmt so that block uses column-aligned types?

Comment thread test/cases/neuron-dra/main_test.go Outdated
}

// Install Neuron DRA driver via Helm chart.
setUpFunctions = append(setUpFunctions,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These setup functions will execute sequentially (blocking calls) but a few of the steps don't have a dependency on each other (setting up mpi operator, deploying dranet and helm install of neuron dra driver) so you could also kick them off concurrently. I'll let you take the call here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I will change this to run concurrently.

Comment thread test/cases/neuron-dra/topology.go Outdated
continue
}
ext := filepath.Ext(entry.Name())
if ext != ".yaml" && ext != ".yml" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could do extract into a common function and reuse in loadRCTIndex and loadRCTManifests

func isYAMLFile(name string) bool {
	ext := filepath.Ext(name)
	return ext == ".yaml" || ext == ".yml"
}

Also loadRCTIndex and loadRCTManifests seems to have an overlap in the work that they do. Might be worth thinking of having a single function to avoid too much code duplication.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadRCTIndex and loadRCTManifests fundamentally have different functionalities.
The way the test runs is:

  • we apply all RCT manifests for a given instance before the test begins. loadRCTManifests returns a list of all these manifests.
  • As part of the each test, a RCT is added as a resource to the MPIJob's worker container. Which resource needs to be added is defined in testcases folder. To calculate how many neurons are required for the test, we require a map for each RCT which loadRCTIndex provides.

Separation of these 2 functionalities will be easy to read the code instead of merging them.

Made change to introduce isYAMLFile function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me. Thanks!

Comment thread test/cases/neuron-dra/main_test.go Outdated

func deployNeuronDRADriver(ctx context.Context, config *envconf.Config) (context.Context, error) {
ds := appsv1.DaemonSet{
ObjectMeta: metav1.ObjectMeta{Name: "neuron-dra-driver-kubelet-plugin", Namespace: "neuron-dra-driver"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a const neuronDRANamespace above for this. so we can remove the hardcoding

Comment thread test/cases/neuron-dra/main_test.go Outdated
Comment thread Dockerfile Outdated
Copy link
Copy Markdown
Contributor

@shvbsle shvbsle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread test/cases/neuron-dra/topology.go Outdated
Comment thread test/cases/neuron-dra/main_test.go Outdated
// loadRCTManifests reads all RCT YAML files for the given instance family from
// the embedded filesystem and returns them as raw byte slices suitable for
// fwext.ApplyManifests.
func loadRCTManifests(family string) ([][]byte, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the parameter here be named nodeType instead of family?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It is the nodeType that is passed and hence functionally it is working but the var name is confusing.
Will change that.

spec:
slotsPerWorker: {{.SlotsPerWorker}}
runPolicy:
backoffLimit: 20
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too high?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took reference from the neuron job manifest

Basically the launcher pod starts before the worker pods' DNS records are ready, causing it to crash and restart in a loop.
The launcher tries to SSH into the workers via the headless service DNS when it comes up(multi-node-nccl-test-worker-{0,1}.multi-node-nccl-test.default.svc), and if the workers are not running yet then they haven't registered their DNS entries yet. This causes the launcher pod to restart via CrashLoopBackOff, and eventually the DNS propagates and it succeeds.

I have seen the launcher pod restarting 2-3 times when the node count is 2. I think this would change with a higher node count.

Let me know if you want to change this value to something lower.

@aws aws deleted a comment from junpengdev Apr 15, 2026
@aws aws deleted a comment from nakshah87 Apr 15, 2026
@shvbsle shvbsle merged commit 3c57d64 into aws:main Apr 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants