Event Exporter

Overview

The Event Exporter streams health events from NVSentinel's datastore to external systems, transforming them into CloudEvents format for integration with enterprise monitoring, analytics, and alerting platforms.

Think of it as a data bridge - it takes health events generated by NVSentinel and delivers them to your external systems for centralized visibility, long-term storage, or integration with existing incident management workflows.

Why Do You Need This?

While NVSentinel handles automated remediation within the cluster, you often need to:

Centralized monitoring: Aggregate events from multiple clusters into a single pane of glass
Long-term analytics: Store events in data warehouses for trend analysis and reporting
Integration: Feed events into existing incident management, ticketing, or alerting systems
Compliance: Meet audit and compliance requirements for event logging
Multi-cluster visibility: Track GPU health across your entire infrastructure

The Event Exporter enables these use cases by streaming events to your external systems in real-time using industry-standard CloudEvents format.

How It Works

The Event Exporter runs as a deployment in the cluster:

Watches the datastore for new health events using change streams
On first startup (if enabled), backfills historical events from the past N days
Transforms health events into CloudEvents format with custom metadata
Publishes events to configured HTTP endpoint with OIDC authentication
Tracks progress using resume tokens for reliable delivery
Retries failed publishes with exponential backoff

The exporter maintains at-least-once delivery semantics by persisting resume tokens, ensuring no events are lost even if the exporter restarts.

Configuration

Configure the Event Exporter through Helm values:

event-exporter:
  enabled: true
  
  # OIDC secret (must be created manually)
  oidcSecretName: "event-exporter-oidc-secret"
  
  exporter:
    # Metadata included with every event
    metadata:
      cluster: "production-us-west"
      environment: "production"
      region: "us-west-2"
    
    # Destination endpoint
    sink:
      endpoint: "https://events.example.com/api/v1/events"
      timeout: "30s"
      insecureSkipVerify: false
    
    # OIDC authentication
    oidc:
      tokenUrl: "https://auth.example.com/oauth2/token"
      clientId: "nvsentinel-exporter"
      scope: "events:write"
      insecureSkipVerify: false
    
    # Historical event backfill
    backfill:
      enabled: true
      maxAge: "720h"      # 30 days
      maxEvents: 1000000
      batchSize: 500
      rateLimit: 1000     # events/second
    
    # Concurrent publish workers
    workers: 10           # See scale-up guide below
    
    # Failure handling
    failureHandling:
      maxRetries: 17      # ~30 minutes
      initialBackoff: "1s"
      maxBackoff: "5m"
      backoffMultiplier: 2.0

Configuration Options

Metadata: Custom key-value pairs included with every event (cluster name is required)
Sink Endpoint: HTTP/HTTPS URL where CloudEvents are posted
OIDC Authentication: OAuth2 client credentials for endpoint authentication
Backfill: On first startup, optionally export historical events (disabled after initial run)
Workers: Number of concurrent goroutines that publish events to the sink in parallel. Resume tokens are advanced in strict order regardless of which worker finishes first, preserving at-least-once delivery guarantees. Note that concurrent publishing means events may arrive at the sink out of order. Default is 10.
Retry Policy: Exponential backoff configuration for failed publishes

CloudEvents Format

Events are transformed into CloudEvents v1.0 format:

{
  "specversion": "1.0",
  "type": "com.nvidia.nvsentinel.health.v1",
  "source": "nvsentinel://production-us-west/healthevents",
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "time": "2025-11-27T10:30:00Z",
  "data": {
    "metadata": {
      "cluster": "production-us-west",
      "environment": "production",
      "region": "us-west-2"
    },
    "healthEvent": {
      "version": "v1",
      "agent": "gpu-health-monitor",
      "componentClass": "GPU",
      "checkName": "XIDError",
      "nodeName": "gpu-node-01",
      "message": "GPU XID error detected",
      "isFatal": true,
      "isHealthy": false,
      "recommendedAction": 2,
      "errorCode": ["XID_79"],
      "entitiesImpacted": [
        {
          "entityType": "GPU",
          "entityValue": "GPU-abc123"
        }
      ],
      "generatedTimestamp": "2025-11-27T10:30:00Z"
    }
  }
}

Key Features

CloudEvents Standard

Uses industry-standard CloudEvents v1.0 format for broad compatibility with event processing platforms.

Historical Backfill

On first deployment, optionally exports up to N days of historical events for complete visibility.

Resume Token Tracking

Persists progress in the datastore to ensure at-least-once delivery - no events lost on restart.

OIDC Authentication

Supports OAuth2 client credentials flow with automatic token refresh for secure authentication.

Exponential Backoff

Retries failed publishes with configurable exponential backoff (up to ~30 minutes by default).

Custom Metadata

Enriches every event with custom metadata (cluster, environment, region, etc.) for filtering and routing.

Rate Limiting

Configurable rate limiting for backfill to avoid overwhelming destination systems.

Concurrent Workers

Publishes events in parallel using a configurable worker pool. A sequence tracker ensures resume tokens advance in strict order regardless of which worker finishes first, preserving at-least-once delivery guarantees. Note that concurrent publishing means events may arrive at the sink out of order. See the configuration reference for sizing guidance.

Change Stream Based

Uses datastore change streams for real-time event delivery with minimal latency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event Exporter

Overview

Why Do You Need This?

How It Works

Configuration

Configuration Options

CloudEvents Format

Key Features

CloudEvents Standard

Historical Backfill

Resume Token Tracking

OIDC Authentication

Exponential Backoff

Custom Metadata

Rate Limiting

Concurrent Workers

Change Stream Based

FilesExpand file tree

event-exporter.md

Latest commit

History

event-exporter.md

File metadata and controls

Event Exporter

Overview

Why Do You Need This?

How It Works

Configuration

Configuration Options

CloudEvents Format

Key Features

CloudEvents Standard

Historical Backfill

Resume Token Tracking

OIDC Authentication

Exponential Backoff

Custom Metadata

Rate Limiting

Concurrent Workers

Change Stream Based