Add support to emit metric to the target AMP by weicongw · Pull Request #486 · aws/aws-k8s-tester

weicongw · 2024-09-24T22:01:32Z

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta

Issue #, if available:

Description of changes:

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
The test support emitting metric from cross account cross region
If amp url is not set, the test will not emitting metics
Emit NCCL test avg bus bandwith metric
Add metadata label to the metric
Add/update readme

Test

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia --efaEnabled=true --feature=multi-node --ampMetricUrl=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/remote_write --ampMetricRoleArn=arn:aws:iam::665181186642:role/amp
...
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 3.68456 
        [1,0]<stdout>:#
        [1,0]<stdout>:
        
    mpi_test.go:145: Emitting nccl test metrics to AMP

Query the metric from AMP

export AMP_QUERY_ENDPOINT=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/query

awscurl -X POST --region us-west-2 \
--service aps "${AMP_QUERY_ENDPOINT}" \
-d 'query=nccl_average_bandwidth_gbps[60m]' \
--header 'Content-Type: application/x-www-form-urlencoded'

{"status":"success","data":{"resultType":"matrix","result":[{"metric":
{"__name__":"nccl_average_bandwidth_gbps","ami_id":"ami-0cd7612ff47454cd6",
"aws_ofi_nccl_version":"1.9.1","efa_count":"1","efa_enabled":"true",
"efa_installer_version":"1.34.0","instance_type":"p4de.24xlarge",
"kubernetes_version":"1.30+","nccl_version":"2.18.5","node_count":"2",
"nvidia_driver_version":"550.90.07","os_type":"Amazon Linux 2"},
"values":[[1726791286.534,"3.62432"],[1726794564.87,"3.68456"]]}]}}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cartermckinnon · 2024-09-24T23:13:45Z

+}
+
+// PushMetricsToAMP pushes metric data to AWS Managed Prometheus (AMP) using SigV4 authentication
+func (m *MetricManager) PushMetricsToAMP(name string, help string, value float64) error {


batching the samples would be preferable to making a separate call to the remote_write API for every sample we collect, IMO

are you not able to use the upstream remote write client because of the assume-role jump? https://github.com/prometheus/prometheus/blob/5037cf75f2d4f1671ad365ba1e99902fc36808d5/storage/remote/client.go#L180

For the first point, that sounds good—I’ll change it in the next revision. As for the second point, I spent some time trying to use the remote write client, but I wasn’t able to integrate it into my code.

cartermckinnon · 2024-09-24T23:28:02Z

+		return nil, fmt.Errorf("no nodes found in the cluster")
+	}
+
+	// Get instance type and metadata from the first node


the test case shouldn't really assume that all the nodes in the cluster are the same across all these dimensions; can you pass in the dimensions with your sample, instead of fetching them ahead of time? Then you'd be able to pass dimensions that you know match the sample

Updated in the latest revision

cartermckinnon · 2024-09-24T23:32:20Z

+CMD echo "EFA Installer Version: $EFA_INSTALLER_VERSION" && \
+    echo "NCCL Version: $NCCL_VERSION" && \
+    echo "AWS OFI NCCL Version: $AWS_OFI_NCCL_VERSION" && \
+    printf "NVIDIA Driver Version: " && \
+    nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1


I think this would be better suited for an ENTRYPOINT script that logged this info and then ran whatever CMD was used

Updated in the latest revision

cartermckinnon · 2024-09-24T23:34:02Z

+		"os_type":            osType,
+	}
+
+	// Create a job to fetch the logs of meta info


seems like you could just log these details in your actual test run instead of using a separate pod to print them

I couldn't, I tried to add ENTRYPOINT in my dockerfile, but the nccl test pods doesn't print these details.

Wondering the same..the launcher pods or worker pods should have these details right as they also run the entrypoint script ?

cartermckinnon · 2024-09-24T23:36:28Z

+### Enter the Kubetest2 Container
+
+```bash
+docker run --name kubetest2 -d -i -t kubetest2 /bin/sh
+docker exec -it kubetest2 sh


I would just build the deployer and e2e-nvidia binary locally, would be simpler + faster during dev

Isn't the Kubetest2 the deployer?

Issacwww · 2024-09-25T04:50:22Z

+	job := &batchv1.Job{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      "metadata-job",
+			Namespace: "default",
+		},
+		Spec: batchv1.JobSpec{
+			Template: v1.PodTemplateSpec{
+				Spec: v1.PodSpec{
+					RestartPolicy: v1.RestartPolicyNever,
+					Containers: []v1.Container{
+						{
+							Name:            "metadata-job",
+							Image:           *nvidiaTestImage,
+							ImagePullPolicy: v1.PullAlways,
+							Resources: v1.ResourceRequirements{
+								Limits: v1.ResourceList{
+									"nvidia.com/gpu":        node.Status.Capacity["nvidia.com/gpu"],
+									"vpc.amazonaws.com/efa": node.Status.Capacity["vpc.amazonaws.com/efa"],
+								},
+							},
+						},
+					},
+				},
+			},
+		},
+	}


I think we can use a template here to reduce the function size

Updated in new rev

…rometheus workspace

Pavani-Panakanti · 2024-11-13T20:19:12Z

+		ObjectMeta: metav1.ObjectMeta{Name: "metadata-job", Namespace: "default"},
+	}
+	err = wait.For(fwext.NewConditionExtension(cfg.Client().Resources()).JobSucceeded(job),
+		wait.WithContext(ctx))


Can we add some comments around - purpose of this job and what it running

Pavani-Panakanti · 2024-11-13T20:23:40Z

-ARG EFA_INSTALLER_VERSION=latest
+# Add ENV to make ARG values available at runtime
+ARG EFA_INSTALLER_VERSION=1.34.0
+ARG NCCL_VERSION=2.18.5


Why are we using an older version of nccl here ? General recommendation is to use either of last 2 releases (preferred n-1 as latest might have issues)

Pavani-Panakanti · 2024-11-13T20:27:42Z

+)
+
+type MetricManager struct {
+	// Metadata              map[string]string


Are we planning to use this field later ?

Pavani-Panakanti · 2024-11-13T20:28:49Z

@weicongw Can we also rebase the PR ? Thanks

weicongw marked this pull request as ready for review September 24, 2024 22:01

Issacwww reviewed Sep 24, 2024

View reviewed changes

Comment thread e2e2/internal/utils/aws.go Outdated

cartermckinnon reviewed Sep 24, 2024

View reviewed changes

weicongw force-pushed the new-unit branch from 55736b1 to c29e969 Compare September 25, 2024 01:31

Issacwww reviewed Sep 25, 2024

View reviewed changes

Add support to emit metric to the target Amazon Managed Service for P…

99ad368

…rometheus workspace

weicongw force-pushed the new-unit branch from c29e969 to 99ad368 Compare September 25, 2024 21:15

Pavani-Panakanti reviewed Nov 13, 2024

View reviewed changes

Conversation

weicongw commented Sep 24, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pavani-Panakanti Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pavani-Panakanti Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pavani-Panakanti commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Pavani-Panakanti Nov 13, 2024 •

edited

Loading

Pavani-Panakanti Nov 13, 2024 •

edited

Loading