Add support to emit metric to the target AMP#486
Conversation
| } | ||
|
|
||
| // PushMetricsToAMP pushes metric data to AWS Managed Prometheus (AMP) using SigV4 authentication | ||
| func (m *MetricManager) PushMetricsToAMP(name string, help string, value float64) error { |
There was a problem hiding this comment.
-
batching the samples would be preferable to making a separate call to the remote_write API for every sample we collect, IMO
-
are you not able to use the upstream remote write client because of the assume-role jump? https://github.com/prometheus/prometheus/blob/5037cf75f2d4f1671ad365ba1e99902fc36808d5/storage/remote/client.go#L180
There was a problem hiding this comment.
For the first point, that sounds good—I’ll change it in the next revision. As for the second point, I spent some time trying to use the remote write client, but I wasn’t able to integrate it into my code.
| return nil, fmt.Errorf("no nodes found in the cluster") | ||
| } | ||
|
|
||
| // Get instance type and metadata from the first node |
There was a problem hiding this comment.
the test case shouldn't really assume that all the nodes in the cluster are the same across all these dimensions; can you pass in the dimensions with your sample, instead of fetching them ahead of time? Then you'd be able to pass dimensions that you know match the sample
There was a problem hiding this comment.
Updated in the latest revision
| CMD echo "EFA Installer Version: $EFA_INSTALLER_VERSION" && \ | ||
| echo "NCCL Version: $NCCL_VERSION" && \ | ||
| echo "AWS OFI NCCL Version: $AWS_OFI_NCCL_VERSION" && \ | ||
| printf "NVIDIA Driver Version: " && \ | ||
| nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 |
There was a problem hiding this comment.
I think this would be better suited for an ENTRYPOINT script that logged this info and then ran whatever CMD was used
There was a problem hiding this comment.
Updated in the latest revision
| "os_type": osType, | ||
| } | ||
|
|
||
| // Create a job to fetch the logs of meta info |
There was a problem hiding this comment.
seems like you could just log these details in your actual test run instead of using a separate pod to print them
There was a problem hiding this comment.
I couldn't, I tried to add ENTRYPOINT in my dockerfile, but the nccl test pods doesn't print these details.
There was a problem hiding this comment.
Wondering the same..the launcher pods or worker pods should have these details right as they also run the entrypoint script ?
| ### Enter the Kubetest2 Container | ||
|
|
||
| ```bash | ||
| docker run --name kubetest2 -d -i -t kubetest2 /bin/sh | ||
| docker exec -it kubetest2 sh |
There was a problem hiding this comment.
I would just build the deployer and e2e-nvidia binary locally, would be simpler + faster during dev
There was a problem hiding this comment.
Isn't the Kubetest2 the deployer?
| job := &batchv1.Job{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: "metadata-job", | ||
| Namespace: "default", | ||
| }, | ||
| Spec: batchv1.JobSpec{ | ||
| Template: v1.PodTemplateSpec{ | ||
| Spec: v1.PodSpec{ | ||
| RestartPolicy: v1.RestartPolicyNever, | ||
| Containers: []v1.Container{ | ||
| { | ||
| Name: "metadata-job", | ||
| Image: *nvidiaTestImage, | ||
| ImagePullPolicy: v1.PullAlways, | ||
| Resources: v1.ResourceRequirements{ | ||
| Limits: v1.ResourceList{ | ||
| "nvidia.com/gpu": node.Status.Capacity["nvidia.com/gpu"], | ||
| "vpc.amazonaws.com/efa": node.Status.Capacity["vpc.amazonaws.com/efa"], | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| } |
There was a problem hiding this comment.
I think we can use a template here to reduce the function size
There was a problem hiding this comment.
Updated in new rev
…rometheus workspace
| ObjectMeta: metav1.ObjectMeta{Name: "metadata-job", Namespace: "default"}, | ||
| } | ||
| err = wait.For(fwext.NewConditionExtension(cfg.Client().Resources()).JobSucceeded(job), | ||
| wait.WithContext(ctx)) |
There was a problem hiding this comment.
Can we add some comments around - purpose of this job and what it running
| ARG EFA_INSTALLER_VERSION=latest | ||
| # Add ENV to make ARG values available at runtime | ||
| ARG EFA_INSTALLER_VERSION=1.34.0 | ||
| ARG NCCL_VERSION=2.18.5 |
There was a problem hiding this comment.
Why are we using an older version of nccl here ? General recommendation is to use either of last 2 releases (preferred n-1 as latest might have issues)
| ) | ||
|
|
||
| type MetricManager struct { | ||
| // Metadata map[string]string |
There was a problem hiding this comment.
Are we planning to use this field later ?
|
@weicongw Can we also rebase the PR ? Thanks |
Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta
Issue #, if available:
Description of changes:
Test
Query the metric from AMP
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.