Add neuron-dra e2e test cases with DRA driver and dranet manifests#782
Add neuron-dra e2e test cases with DRA driver and dranet manifests#782
Conversation
be4bc62 to
b8a8d5e
Compare
| template: | ||
| metadata: | ||
| annotations: | ||
| sidecar.istio.io/inject: "false" |
There was a problem hiding this comment.
Is this left over from local testing?
| template: | ||
| metadata: | ||
| annotations: | ||
| sidecar.istio.io/inject: "false" |
There was a problem hiding this comment.
same here. Why is this annotation needed?
| nodeType *string | ||
| rdmaDeviceDraDriverImage *string | ||
| acceleratorDraDriverImage *string | ||
| containerTestImage *string |
There was a problem hiding this comment.
Can you run your code through gofmt so that block uses column-aligned types?
| } | ||
|
|
||
| // Install Neuron DRA driver via Helm chart. | ||
| setUpFunctions = append(setUpFunctions, |
There was a problem hiding this comment.
Nit: These setup functions will execute sequentially (blocking calls) but a few of the steps don't have a dependency on each other (setting up mpi operator, deploying dranet and helm install of neuron dra driver) so you could also kick them off concurrently. I'll let you take the call here.
There was a problem hiding this comment.
Good idea. I will change this to run concurrently.
| continue | ||
| } | ||
| ext := filepath.Ext(entry.Name()) | ||
| if ext != ".yaml" && ext != ".yml" { |
There was a problem hiding this comment.
Nit: could do extract into a common function and reuse in loadRCTIndex and loadRCTManifests
func isYAMLFile(name string) bool {
ext := filepath.Ext(name)
return ext == ".yaml" || ext == ".yml"
}Also loadRCTIndex and loadRCTManifests seems to have an overlap in the work that they do. Might be worth thinking of having a single function to avoid too much code duplication.
There was a problem hiding this comment.
loadRCTIndex and loadRCTManifests fundamentally have different functionalities.
The way the test runs is:
- we apply all RCT manifests for a given instance before the test begins. loadRCTManifests returns a list of all these manifests.
- As part of the each test, a RCT is added as a resource to the MPIJob's worker container. Which resource needs to be added is defined in testcases folder. To calculate how many neurons are required for the test, we require a map for each RCT which loadRCTIndex provides.
Separation of these 2 functionalities will be easy to read the code instead of merging them.
Made change to introduce isYAMLFile function.
There was a problem hiding this comment.
makes sense to me. Thanks!
|
|
||
| func deployNeuronDRADriver(ctx context.Context, config *envconf.Config) (context.Context, error) { | ||
| ds := appsv1.DaemonSet{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: "neuron-dra-driver-kubelet-plugin", Namespace: "neuron-dra-driver"}, |
There was a problem hiding this comment.
We already have a const neuronDRANamespace above for this. so we can remove the hardcoding
| // loadRCTManifests reads all RCT YAML files for the given instance family from | ||
| // the embedded filesystem and returns them as raw byte slices suitable for | ||
| // fwext.ApplyManifests. | ||
| func loadRCTManifests(family string) ([][]byte, error) { |
There was a problem hiding this comment.
Shouldn't the parameter here be named nodeType instead of family?
There was a problem hiding this comment.
You are right. It is the nodeType that is passed and hence functionally it is working but the var name is confusing.
Will change that.
| spec: | ||
| slotsPerWorker: {{.SlotsPerWorker}} | ||
| runPolicy: | ||
| backoffLimit: 20 |
There was a problem hiding this comment.
I took reference from the neuron job manifest
Basically the launcher pod starts before the worker pods' DNS records are ready, causing it to crash and restart in a loop.
The launcher tries to SSH into the workers via the headless service DNS when it comes up(multi-node-nccl-test-worker-{0,1}.multi-node-nccl-test.default.svc), and if the workers are not running yet then they haven't registered their DNS entries yet. This causes the launcher pod to restart via CrashLoopBackOff, and eventually the DNS propagates and it succeeds.
I have seen the launcher pod restarting 2-3 times when the node count is 2. I think this would change with a higher node count.
Let me know if you want to change this value to something lower.
Issue #, if available:
Add Neuron DRA e2e test suite
This PR adds end-to-end tests for the Neuron DRA (Dynamic Resource Allocation) driver, validating that Neuron accelerator
and EFA RDMA devices are correctly allocated to multi-node MPI workloads via the Kubernetes DRA framework.
What's included
Test framework (
test/cases/neuron-dra/)ResourceClaimTemplates, and the Neuron DRA driver (installed via Helm). Dynamically builds the manifest and setup
function list based on the instance family's RDMA type.
parameters from ResourceClaimTemplate specs, and runs both positive (MPIJob succeeds) and negative (workers remain
Pending) assertions.
types, and test case directories. Also contains all parsing, parameter computation, and MPIJob template rendering logic.
MPIJob template (
templates/nccom-test-mpijob.yaml.tmpl)image, and resource claims. The launcher runs nccom-test across all workers; workers expose SSH and configure RDMA
networking.
Test cases (
testcases/trn1/)unschedulable.
ResourceClaimTemplates (
rcts/trn1/)rct-all-efas-all-neurons.yaml — Claims all Neuron and EFA devices with allocationMode: All.
rct-2-efas-4-neurons-wrong-match.yaml — Claims 4 Neurons + 2 EFAs with an intentionally wrong matchAttribute
constraint.
Shared helpers (
test/common/dra.go)DeployDranet() — Renders the dranet manifest template, applies it, and waits for the DaemonSet to be ready.
DeployMPIOperator() — Applies the MPI operator manifest and waits for the Deployment to become available.
Infrastructure
templated image field.
How it works
slots-per-worker, and total MPI ranks.
Adding new test cases
To add a test for a new device/EFA combination:
To add support for a new instance family, add an entry to instanceTopologies in topology.go.
Testing:
Tested the neuron-dra.test on my cluster and verified that the tests are running fine.
The test can be invoked as follows:
The acceleratorDraDriverImage flag is optional. If not provided, it installs using the image present in the helm chart.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.