fix(kuma-dp): bridge self metrics to otel path#16226
Conversation
Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a gap where Kuma-DP process-local Prometheus metrics (e.g., DNS proxy and config fetcher metrics registered on prometheus.DefaultRegisterer) were not being exported when MeshMetric is configured with an OpenTelemetry backend, because the OTel path only collected metrics from the scraping-based AggregatedProducer.
Changes:
- Bridge Kuma-DP’s own Prometheus
DefaultGatherermetrics into the OTel export pipeline usinggo.opentelemetry.io/contrib/bridges/prometheus. - Add the bridge as an additional
sdkmetric.Produceron the OTLPPeriodicReader, alongside the existing scraping producer.
Reviewer Checklist🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
|
utafrali
left a comment
There was a problem hiding this comment.
Correct, minimal fix that follows the established pattern from the CP OTel pusher and uses an already-direct dependency. The two comments are low-severity: one is a scope clarification and the other is a test coverage suggestion for a code path that was already untested before this PR.
Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
## Motivation PR #16201 added new Kuma-DP Prometheus metrics (DNS proxy, config fetcher, etc.) registered on `prometheus.DefaultRegisterer`. Under MeshMetric with an OpenTelemetry backend these metrics never reach the pipeline: the OTel path uses `AggregatedProducer` which only HTTP-scrapes Envoy + user apps, so anything registered on the kuma-dp process itself is silently dropped. The Prometheus backend path is unaffected — the hijacker HTTP handler already serves `DefaultGatherer` on `/meshmetric`. Symptom: `kuma_dp_dns_*` series are missing from Prometheus when the mesh is configured with a MeshOpenTelemetryBackend. This also silently affects the existing `kuma_dp_envoyconfigfetcher_*` metrics and the `kuma_dp_dns_request_duration_seconds_*` histogram referenced by the workload-debug dashboard. ## Implementation information Two commits: 1. **`fix(kuma-dp): bridge self metrics to otel path`** — add a second `sdkmetric.Producer` to the `PeriodicReader` in `startExporter` that wraps `prometheus.DefaultGatherer` via `go.opentelemetry.io/contrib/bridges/prometheus`. The bridge package is already a direct dependency — `pkg/plugins/runtime/opentelemetry/metrics.go` uses the same pattern for the control-plane OTel pusher. Both pipe and direct (Envoy-socket) OTel backends flow through `startExporter`, so one change covers both paths. No new deps, no config surface change. 2. **`fix(dns): fix race in multi-listener shutdown`** — drive-by fix for a race exposed by the existing `returns the first listener error when another listener started` test. When one DNS listener fails immediately (e.g. bind `address already in use`) and another is racing to call `NotifyStartedFunc`, the shutdown loop may read `started[i]=false` and skip `Shutdown()`. The drain-errCh loop then blocks forever waiting for an exit signal that never comes. Fix: retry `Shutdown()` in a goroutine until either the server transitions to started or its per-listener `done` channel fires. Hit this as a flake on the original CI run for this PR; ginkgo `-repeat=5` locally is now stable. Alternatives considered for (1): - Adding a synthetic `ApplicationToScrape` pointing at the hijacker's own `/meshmetric` path — creates a self-scrape loop and duplicates parse/serialize work. - Migrating `kuma_dp_dns_*` to native OTel instruments — larger blast radius, breaks the Prometheus backend path which currently works. ## Supporting documentation Follow-up to #16201. > Changelog: fix(kuma-dp): ship kuma-dp self metrics to OpenTelemetry backends --------- Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
Motivation
PR #16201 added new Kuma-DP Prometheus metrics (DNS proxy, config fetcher, etc.) registered on
prometheus.DefaultRegisterer. Under MeshMetric with an OpenTelemetry backend these metrics never reach the pipeline: the OTel path usesAggregatedProducerwhich only HTTP-scrapes Envoy + user apps, so anything registered on the kuma-dp process itself is silently dropped.The Prometheus backend path is unaffected — the hijacker HTTP handler already serves
DefaultGathereron/meshmetric.Symptom:
kuma_dp_dns_*series are missing from Prometheus when the mesh is configured with a MeshOpenTelemetryBackend. This also silently affects the existingkuma_dp_envoyconfigfetcher_*metrics and thekuma_dp_dns_request_duration_seconds_*histogram referenced by the workload-debug dashboard.Implementation information
Two commits:
fix(kuma-dp): bridge self metrics to otel path— add a secondsdkmetric.Producerto thePeriodicReaderinstartExporterthat wrapsprometheus.DefaultGathererviago.opentelemetry.io/contrib/bridges/prometheus. The bridge package is already a direct dependency —pkg/plugins/runtime/opentelemetry/metrics.gouses the same pattern for the control-plane OTel pusher. Both pipe and direct (Envoy-socket) OTel backends flow throughstartExporter, so one change covers both paths. No new deps, no config surface change.fix(dns): fix race in multi-listener shutdown— drive-by fix for a race exposed by the existingreturns the first listener error when another listener startedtest. When one DNS listener fails immediately (e.g. bindaddress already in use) and another is racing to callNotifyStartedFunc, the shutdown loop may readstarted[i]=falseand skipShutdown(). The drain-errCh loop then blocks forever waiting for an exit signal that never comes. Fix: retryShutdown()in a goroutine until either the server transitions to started or its per-listenerdonechannel fires. Hit this as a flake on the original CI run for this PR; ginkgo-repeat=5locally is now stable.Alternatives considered for (1):
ApplicationToScrapepointing at the hijacker's own/meshmetricpath — creates a self-scrape loop and duplicates parse/serialize work.kuma_dp_dns_*to native OTel instruments — larger blast radius, breaks the Prometheus backend path which currently works.Supporting documentation
Follow-up to #16201.