Skip to content

fix(kuma-dp): bridge self metrics to otel path#16226

Merged
Automaat merged 2 commits intomasterfrom
fix/kuma-dp-otel-self-metrics
Apr 13, 2026
Merged

fix(kuma-dp): bridge self metrics to otel path#16226
Automaat merged 2 commits intomasterfrom
fix/kuma-dp-otel-self-metrics

Conversation

@Automaat
Copy link
Copy Markdown
Contributor

@Automaat Automaat commented Apr 11, 2026

Motivation

PR #16201 added new Kuma-DP Prometheus metrics (DNS proxy, config fetcher, etc.) registered on prometheus.DefaultRegisterer. Under MeshMetric with an OpenTelemetry backend these metrics never reach the pipeline: the OTel path uses AggregatedProducer which only HTTP-scrapes Envoy + user apps, so anything registered on the kuma-dp process itself is silently dropped.

The Prometheus backend path is unaffected — the hijacker HTTP handler already serves DefaultGatherer on /meshmetric.

Symptom: kuma_dp_dns_* series are missing from Prometheus when the mesh is configured with a MeshOpenTelemetryBackend. This also silently affects the existing kuma_dp_envoyconfigfetcher_* metrics and the kuma_dp_dns_request_duration_seconds_* histogram referenced by the workload-debug dashboard.

Implementation information

Two commits:

  1. fix(kuma-dp): bridge self metrics to otel path — add a second sdkmetric.Producer to the PeriodicReader in startExporter that wraps prometheus.DefaultGatherer via go.opentelemetry.io/contrib/bridges/prometheus. The bridge package is already a direct dependency — pkg/plugins/runtime/opentelemetry/metrics.go uses the same pattern for the control-plane OTel pusher. Both pipe and direct (Envoy-socket) OTel backends flow through startExporter, so one change covers both paths. No new deps, no config surface change.

  2. fix(dns): fix race in multi-listener shutdown — drive-by fix for a race exposed by the existing returns the first listener error when another listener started test. When one DNS listener fails immediately (e.g. bind address already in use) and another is racing to call NotifyStartedFunc, the shutdown loop may read started[i]=false and skip Shutdown(). The drain-errCh loop then blocks forever waiting for an exit signal that never comes. Fix: retry Shutdown() in a goroutine until either the server transitions to started or its per-listener done channel fires. Hit this as a flake on the original CI run for this PR; ginkgo -repeat=5 locally is now stable.

Alternatives considered for (1):

  • Adding a synthetic ApplicationToScrape pointing at the hijacker's own /meshmetric path — creates a self-scrape loop and duplicates parse/serialize work.
  • Migrating kuma_dp_dns_* to native OTel instruments — larger blast radius, breaks the Prometheus backend path which currently works.

Supporting documentation

Follow-up to #16201.

Changelog: fix(kuma-dp): ship kuma-dp self metrics to OpenTelemetry backends

Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
@Automaat Automaat added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label Apr 11, 2026
@Automaat Automaat requested a review from a team as a code owner April 11, 2026 12:07
@Automaat Automaat requested review from bartsmykla and slonka April 11, 2026 12:07
@Automaat Automaat added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label Apr 11, 2026
Copilot AI review requested due to automatic review settings April 11, 2026 12:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a gap where Kuma-DP process-local Prometheus metrics (e.g., DNS proxy and config fetcher metrics registered on prometheus.DefaultRegisterer) were not being exported when MeshMetric is configured with an OpenTelemetry backend, because the OTel path only collected metrics from the scraping-based AggregatedProducer.

Changes:

  • Bridge Kuma-DP’s own Prometheus DefaultGatherer metrics into the OTel export pipeline using go.opentelemetry.io/contrib/bridges/prometheus.
  • Add the bridge as an additional sdkmetric.Producer on the OTLP PeriodicReader, alongside the existing scraping producer.

@github-actions
Copy link
Copy Markdown
Contributor

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

Copy link
Copy Markdown

@utafrali utafrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, minimal fix that follows the established pattern from the CP OTel pusher and uses an already-direct dependency. The two comments are low-severity: one is a scope clarification and the other is a test coverage suggestion for a code path that was already untested before this PR.

Comment thread app/kuma-dp/pkg/dataplane/meshmetrics/component.go
Comment thread app/kuma-dp/pkg/dataplane/meshmetrics/component.go
Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
@Automaat Automaat merged commit 6d32503 into master Apr 13, 2026
48 of 50 checks passed
@Automaat Automaat deleted the fix/kuma-dp-otel-self-metrics branch April 13, 2026 08:21
Automaat added a commit that referenced this pull request Apr 14, 2026
## Motivation

PR #16201 added new Kuma-DP Prometheus metrics (DNS proxy, config
fetcher, etc.) registered on `prometheus.DefaultRegisterer`. Under
MeshMetric with an OpenTelemetry backend these metrics never reach the
pipeline: the OTel path uses `AggregatedProducer` which only
HTTP-scrapes Envoy + user apps, so anything registered on the kuma-dp
process itself is silently dropped.

The Prometheus backend path is unaffected — the hijacker HTTP handler
already serves `DefaultGatherer` on `/meshmetric`.

Symptom: `kuma_dp_dns_*` series are missing from Prometheus when the
mesh is configured with a MeshOpenTelemetryBackend. This also silently
affects the existing `kuma_dp_envoyconfigfetcher_*` metrics and the
`kuma_dp_dns_request_duration_seconds_*` histogram referenced by the
workload-debug dashboard.

## Implementation information

Two commits:

1. **`fix(kuma-dp): bridge self metrics to otel path`** — add a second
`sdkmetric.Producer` to the `PeriodicReader` in `startExporter` that
wraps `prometheus.DefaultGatherer` via
`go.opentelemetry.io/contrib/bridges/prometheus`. The bridge package is
already a direct dependency —
`pkg/plugins/runtime/opentelemetry/metrics.go` uses the same pattern for
the control-plane OTel pusher. Both pipe and direct (Envoy-socket) OTel
backends flow through `startExporter`, so one change covers both paths.
No new deps, no config surface change.

2. **`fix(dns): fix race in multi-listener shutdown`** — drive-by fix
for a race exposed by the existing `returns the first listener error
when another listener started` test. When one DNS listener fails
immediately (e.g. bind `address already in use`) and another is racing
to call `NotifyStartedFunc`, the shutdown loop may read
`started[i]=false` and skip `Shutdown()`. The drain-errCh loop then
blocks forever waiting for an exit signal that never comes. Fix: retry
`Shutdown()` in a goroutine until either the server transitions to
started or its per-listener `done` channel fires. Hit this as a flake on
the original CI run for this PR; ginkgo `-repeat=5` locally is now
stable.

Alternatives considered for (1):
- Adding a synthetic `ApplicationToScrape` pointing at the hijacker's
own `/meshmetric` path — creates a self-scrape loop and duplicates
parse/serialize work.
- Migrating `kuma_dp_dns_*` to native OTel instruments — larger blast
radius, breaks the Prometheus backend path which currently works.

## Supporting documentation

Follow-up to #16201.

> Changelog: fix(kuma-dp): ship kuma-dp self metrics to OpenTelemetry
backends

---------

Signed-off-by: Marcin Skalski <skalskimarcin33@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants