NVSentinel Development Guide

This guide covers project structure, development workflows, and best practices for contributing to NVSentinel.

📋 Table of Contents

Getting Started
Project Architecture
Development Environment Setup
Development Workflows
Module Development
Testing
Code Standards
CI/CD Pipeline
Debugging
Makefile Reference

🚀 Getting Started

Quick Setup

git clone https://github.com/nvidia/nvsentinel.git
cd nvsentinel
make dev-env-setup  # Install all dependencies

Tool Version Management

NVSentinel uses .versions.yaml for centralized version management across:

Local development
CI/CD pipelines
Container builds

View current versions:

make show-versions

Prerequisites

Core Tools (required):

Go 1.25+ - See .versions.yaml for exact version
Docker
kubectl
Helm 3.0+
Protocol Buffers Compiler
yq - YAML processor for version management

Development Tools:

Optional (for local Kubernetes development):

Quick install (installs and configures all tools):

make dev-env-setup

This will:

Detect your OS (Linux/macOS) and architecture
Install yq and check for required tools
Install development and Go tools
Configure Python gRPC tools

Automated installation: To skip interactive prompts and auto-install all dependencies:

make dev-env-setup AUTO_MODE=true

Debugging setup issues: If the setup script fails, enable debug mode for detailed output:

DEBUG=true make dev-env-setup AUTO_MODE=true

This will show:

Architecture detection and mappings
URL construction for all downloads
HTTP response codes for failed downloads
Detailed error messages with suggestions

Common setup issues and solutions are documented in the Debugging section.

Build System

Unified build system features:

Consistent interface: All modules support common targets (all, lint-test, clean)
Technology-aware: Appropriate tooling for Go, Python, and shell scripts
Delegation pattern: Top-level Makefiles delegate to individual modules
Repo-root context: Docker builds use consistent paths
Multi-platform support: Built-in linux/arm64,linux/amd64 via Docker buildx

🏗️ Project Architecture

NVSentinel follows a microservices architecture with event-driven communication:

Core Principles

Independence: Modules operate autonomously
Event-Driven: Communication through MongoDB change streams
Modular: Pluggable health monitors
Cloud-Native: Kubernetes-first design

Module Types

nvsentinel/
├── health-monitors/           # Hardware/software fault detection
│   ├── gpu-health-monitor/   # Python - DCGM GPU monitoring
│   ├── syslog-health-monitor/   # Go - System log monitoring
│   └── csp-health-monitor/      # Go - Cloud provider monitoring
├── platform-connectors/      # gRPC event ingestion service
├── fault-quarantine/   # CEL-based event quarantine logic
├── fault-remediation/   # Kubernetes controller for remediation
├── health-events-analyzer/    # Event analysis and correlation
├── health-event-client/       # Event streaming client
├── labeler/           # Node labeling controller
├── node-drainer/      # Graceful workload eviction
├── preflight/               # Admission webhook controller (Go)
├── preflight-checks/        # Init container check images
│   ├── dcgm-diag/          # Python - DCGM GPU diagnostics
│   ├── nccl-loopback/      # Go - Single-node NCCL test
│   └── nccl-allreduce/     # Python - Multi-node NCCL test
├── store-client/         # MongoDB interaction library (tested in CI)
└── log-collector/ # Log aggregation (shell scripts)

Communication Flow

sequenceDiagram
    participant HM as Health Monitor
    participant PC as Platform Connectors
    participant DB as MongoDB
    participant FM as Fault Module

    HM->>PC: gRPC health event
    PC->>DB: Store event
    DB->>FM: Change stream notification
    FM->>DB: Query related events
    FM->>K8s: Execute remediation action

🛠️ Development Environment Setup

1. Local Development with Tilt

Tilt provides the fastest development experience with hot reloading.

# Quick start - create cluster and start Tilt in one command
make dev-env                    # Creates cluster and starts Tilt

# Manual step-by-step approach
make cluster-create             # Creates ctlptl-managed Kind cluster with registry
make tilt-up                    # Starts Tilt with UI (runs: tilt up -f tilt/Tiltfile)

# Check status
make cluster-status             # Check cluster and registry status

# View Tilt UI
# Navigate to http://localhost:10350

# Stop everything when done
make dev-env-clean             # Stops Tilt and deletes cluster

# Or stop individually
make tilt-down                 # Stops Tilt (runs: tilt down -f tilt/Tiltfile)
make cluster-delete            # Deletes the cluster

ctlptl Cluster Features:

Declarative cluster configuration with YAML
Multi-node Kind cluster (3 control-plane, 2 worker nodes)
Cluster name: kind-nvsentinel (required kind- prefix)
Integrated local container registry at localhost:5001
Automatic registry configuration for Tilt
Simplified cluster lifecycle management
No external dependencies beyond Docker, ctlptl, and Kind

2. Manual Development Setup

For module-specific development without full cluster:

# Set up Go environment
export GOPATH=$(go env GOPATH)
export GO_CACHE_DIR=$(go env GOCACHE)

# Install development dependencies
go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest
go install gotest.tools/gotestsum@latest
go install github.com/boumenot/gocover-cobertura@latest

# For controller modules
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest

3. Go Module Configuration

Go module dependencies are handled automatically:

# Dependencies managed via go.mod files with replace directives for local development
# No manual GOPRIVATE configuration needed
# Private repository authentication handled via SSH keys

🔄 Development Workflows

Daily Development Workflow

Start Development Session

git checkout main
git pull origin main
git checkout -b feature/your-feature-name

# Start local development environment
make dev-env  # Creates ctlptl-managed cluster and starts Tilt

Develop with Live Reload

# Edit code - Tilt automatically rebuilds and redeploys
vim health-monitors/syslog-health-monitor/pkg/monitor/monitor.go

# View logs in Tilt UI at http://localhost:10350
# Or use kubectl for specific logs (note: syslog-health-monitor runs as DaemonSet with -regular and -kata variants)
kubectl logs -f daemonset/nvsentinel-syslog-health-monitor-regular -n nvsentinel

Test Changes

# Run tests locally (while Tilt is running)
make health-monitors-lint-test-all              # All health monitors
make health-events-analyzer-lint-test           # Specific Go module
make platform-connectors-lint-test             # Another Go module

# Or run individual module tests directly (using standardized targets)
make -C health-monitors/syslog-health-monitor lint-test
make -C platform-connectors lint-test
make -C health-events-analyzer lint-test

# Preflight E2E (cluster with preflight enabled, e.g. after make dev-env)
# cd tests && go test -tags=amd64_group -run TestPreflightEndToEnd -v ./...

# Test integration with other services via Tilt UI
# Access services via port-forwards set up by Tilt

Validate Before Commit

# Run full test suite
make lint-test-all

# Stop Tilt for final testing if needed
make tilt-down

Commit and Push

git add .
git commit -s -m "feat: add new monitoring capability"
git push origin feature/your-feature-name

# Clean up development environment
make dev-env-clean

Protocol Buffer Development

When modifying .proto files:

# Generate protobuf files
make protos-lint

# This runs:
# - protoc generation for Go modules
# - Python protobuf generation for GPU monitor
# - Import path fixes for Python
# - Git diff check to ensure files are up to date

Container Development

The project provides a unified Docker build system with consistent patterns across all modules. All builds support multi-platform architecture, build caching, and proper context management.

Environment Variables

Set these for production-like builds:

# Docker configuration (standardized across all modules)
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"  # Defaults to repository owner
export CI_COMMIT_REF_NAME="feature-branch"  # Or your branch name

# These are computed automatically by common.mk:
# SAFE_REF_NAME=$(echo $CI_COMMIT_REF_NAME | sed 's/\//-/g')
# PLATFORMS="linux/arm64,linux/amd64"
# MODULE_NAME=$(basename $(CURDIR))

Docker Build Commands

Build System Overview

The Docker build system uses shared patterns via common.mk for Go modules, with specialized handling for Python and container-only modules. Each module maintains its own Docker configuration.

Main build targets (delegated to individual modules):

# Local Development (--load) - builds images into local Docker daemon
make docker-all                       # All images locally (delegates to docker/Makefile)
make docker-health-monitors           # All health monitor images locally
make docker-main-modules              # All non-health-monitor images locally

# CI/Production (--push) - builds and pushes directly to registry
make docker-publish-all               # Build and push all images to registry
make docker-publish-health-monitors   # Build and push health monitor images
make docker-publish-main-modules      # Build and push main module images

# Individual module targets (via common.mk)
make docker-syslog-health-monitor        # Build syslog health monitor locally
make docker-publish-syslog-health-monitor # Build and push to registry
make docker-platform-connectors          # Build platform connectors locally
make docker-publish-platform-connectors  # Build and push to registry

# Special cases
make docker-gpu-health-monitor            # Both DCGM 3.x and 4.x versions locally
make docker-log-collector                 # Container-only module (shell + Python)

Direct docker/ Makefile usage:

cd docker

# Local development builds (--load)
make build-all                    # Build all 12 images locally
make build-health-monitors        # Build health monitor group locally
make build-syslog-health-monitor     # Build specific module locally

# CI/production builds (--push)
make publish-all                  # Build and push all images to registry
make publish-syslog-health-monitor   # Build and push specific image to registry

# Utility commands
make setup-buildx                 # Setup multi-platform builder
make clean                        # Remove all nvsentinel images
make list                         # List built nvsentinel images
make help                         # Show all available targets

Individual module usage:

# Go modules (common.mk patterns)
make -C health-monitors/syslog-health-monitor docker-build       # Local build with remote cache
make -C health-monitors/syslog-health-monitor docker-build-local # Local build, no remote cache (faster)
make -C health-monitors/syslog-health-monitor docker-publish     # CI build
make -C platform-connectors docker-build                        # Local build with remote cache
make -C platform-connectors docker-build-local                  # Local build, no remote cache (faster)
make -C platform-connectors docker-publish                      # CI build
make -C health-events-analyzer docker-build-local               # Local build, no remote cache (faster)
make -C health-events-analyzer docker-publish                   # CI build

# Python module (specialized patterns)
make -C health-monitors/gpu-health-monitor docker-build-dcgm3  # DCGM 3.x local
make -C health-monitors/gpu-health-monitor docker-publish-dcgm4 # DCGM 4.x CI

# Container-only module (shell + Python)
make -C log-collector docker-build-log-collector   # Local build
make -C log-collector docker-publish-log-collector # CI build

Module-Level Docker Builds

Each module provides Docker targets with common patterns:

# Go modules (common.mk patterns)
make -C health-monitors/syslog-health-monitor docker-build       # Local with remote cache
make -C health-monitors/syslog-health-monitor docker-build-local # Local, no remote cache (recommended)
make -C health-monitors/syslog-health-monitor docker-publish     # CI/production
make -C platform-connectors docker-build                        # Local with remote cache
make -C platform-connectors docker-build-local                  # Local, no remote cache (recommended)
make -C platform-connectors docker-publish                      # CI/production

# Python module (specialized patterns)
make -C health-monitors/gpu-health-monitor docker-build-dcgm3  # DCGM 3.x local
make -C health-monitors/gpu-health-monitor docker-build-dcgm4  # DCGM 4.x local
make -C health-monitors/gpu-health-monitor docker-publish     # Push both versions

# Container-only module (shell + Python)
make -C log-collector docker-build                 # Both log-collector and file-server-cleanup
make -C log-collector docker-publish               # Push both components

# Legacy compatibility (all modules)
make -C [module] image      # Calls docker-build
make -C [module] publish    # Calls docker-publish

Docker Build Features

All builds support consistent features:

Multi-Platform Support: linux/arm64,linux/amd64 via common.mk (docker-build); linux/amd64 for docker-build-local
Build Caching: Registry-based build cache for faster builds (docker-build); local cache only (docker-build-local)
Repo-Root Context: All builds use consistent repo-root context
Dynamic Tagging: Uses branch/tag name (${SAFE_REF_NAME}) for docker-build; simple module:local for docker-build-local
Registry Integration: NVCR.io registry paths for docker-build and docker-publish
Module Auto-Detection: Automatic module name detection via $(MODULE_NAME)

For Local Development, use docker-build-local to avoid registry authentication issues and build faster.

Example Build Scenarios

Local Development:

# Recommended: Fast local build (single platform, no remote cache)
make -C health-monitors/syslog-health-monitor docker-build-local

# Alternative: Local build with remote cache (multi-platform, slower)
make -C health-monitors/syslog-health-monitor docker-build

# Legacy: Quick local build
make -C health-monitors/syslog-health-monitor image

CI-like Build:

**CI-like Build:**
```bash
# Set up environment like GitHub Actions
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"
export CI_COMMIT_REF_NAME="main"

# Build all images with full CI features (standardized)
make docker-all

# Images will be tagged like:
# ghcr.io/your-github-username/syslog-health-monitor:main
# ghcr.io/nvidia/nvsentinel/gpu-health-monitor:main-dcgm-3.x
# ghcr.io/nvidia/nvsentinel/gpu-health-monitor:main-dcgm-4.x

Testing Specific Module:

# Recommended: Build and test individual module (fast, local)
make -C platform-connectors docker-build-local
docker run --rm platform-connectors:local --help

# Alternative: Build with full CI features
make docker-platform-connectors
docker run --rm ghcr.io/nvidia/nvsentinel/platform-connectors:fix-make-file-targets --help

# Build private repo module (fast, local)
make -C health-events-analyzer docker-build-local

Build Cache Benefits

The new system uses Docker BuildKit registry cache:

First build: Downloads and caches layers
Subsequent builds: Reuses cached layers for 10x+ speed improvement
Multi-developer: Cache shared across team via registry

Troubleshooting Docker Builds

Build failures:

# Check buildx setup
make -C docker setup-buildx

# Clean and retry
make -C docker clean
docker system prune -f
make docker-syslog-health-monitor

Private repo access:

# Verify SSH key access
git ls-remote git@github.com:dgxcloud/mk8s/some-private-repo.git

# Build with debug output
BUILDKIT_PROGRESS=plain make docker-csp-health-monitor

Registry issues:

# Test registry login
docker login nvcr.io -u '$oauthtoken' -p "$NGC_PASSWORD"

# Check image tags
make -C docker list

macOS/Docker Desktop Socket Directory Requirements

Problem: On macOS with Docker Desktop, Unix domain sockets require the /var/run directory to exist inside containers, but this directory is not created by default in minimal container images.

Symptoms:

Services fail to start with errors like: failed to listen on unix socket /var/run/nvsentinel.sock: no such file or directory
Tilt-based tests fail on macOS but pass on Linux
gRPC Unix socket connections fail

Solution: The project includes a Tilt-specific Helm values file that creates the /var/run directory using an initContainer:

# File: distros/kubernetes/nvsentinel/values-tilt-socket.yaml
#
# This values file is automatically included when running Tilt on macOS/Docker Desktop.
# It adds an initContainer to create /var/run directory for Unix socket communication.

global:
  initContainers:
    - name: create-run-dir
      image: busybox:latest
      command: ['sh', '-c', 'mkdir -p /var/run']
      volumeMounts:
        - name: socket-dir
          mountPath: /var/run

How it works:

The tilt/Tiltfile automatically includes values-tilt-socket.yaml for local development
The initContainer runs before each service starts and creates the /var/run directory
Services can then create Unix sockets at /var/run/nvsentinel.sock
The socket directory is shared via an emptyDir volume mount

Platform-specific behavior:

macOS/Docker Desktop: Requires the initContainer workaround (automatically applied in Tilt)
Linux: The /var/run directory typically exists in the container runtime environment
Production/Kubernetes: Uses standard Helm values without the initContainer (not needed)

Note: This is a development-only workaround for local macOS environments. Production deployments on Linux do not require this configuration.

🧩 Module Development

Creating a New Health Monitor

Create Module Structure

mkdir -p health-monitors/my-monitor/{cmd,pkg,internal}
cd health-monitors/my-monitor

Initialize Go Module

go mod init github.com/nvidia/nvsentinel/health-monitors/my-monitor

Create Module Makefile

# Copy template from existing health monitor
cp ../syslog-health-monitor/Makefile ./Makefile

# Update module-specific settings
sed -i 's/syslog-health-monitor/my-monitor/g' Makefile
sed -i 's/Syslog Health Monitor/My Monitor/g' Makefile

Implement gRPC Client

// pkg/monitor/monitor.go
package monitor

import (
    "context"
    pb "github.com/nvidia/nvsentinel/platform-connectors/pkg/protos"
    "google.golang.org/grpc"
)

type Monitor struct {
    client pb.PlatformConnectorClient
}

func (m *Monitor) SendEvent(ctx context.Context, event *pb.HealthEvent) error {
    _, err := m.client.SendHealthEvent(ctx, event)
    return err
}

Update health-monitors/Makefile

# Add your module to the health monitors list
# Edit health-monitors/Makefile:
# - Add 'my-monitor' to GO_HEALTH_MONITORS list
# - Add lint-test delegation target
# - Add build delegation target
# - Add clean delegation target

Test Your Module

# Test the individual module
make -C health-monitors/my-monitor lint-test

# Test via health-monitors coordination
make -C health-monitors lint-test-my-monitor

# Test via main Makefile delegation
make health-monitors-lint-test-all

Add to CI Pipeline The module will automatically be included in GitHub Actions workflows due to the standardized patterns.

Creating a New Core Module

Follow Kubernetes Controller Pattern

# Use controller-runtime for Kubernetes controllers
go get sigs.k8s.io/controller-runtime

Implement MongoDB Change Streams

// Use store-client for MongoDB operations
import "github.com/nvidia/nvsentinel/store-client/pkg/client"

Add Proper RBAC Create Kubernetes RBAC manifests in distros/kubernetes/nvsentinel/templates/.

Creating a New Preflight Check

Preflight checks run as init containers injected by the preflight webhook. Each check is an independent container image under preflight-checks/. Checks are language-agnostic — the only requirement is a container that reads configuration from environment variables, runs a diagnostic, and on failure sends a HealthEventOccurredV1 gRPC call to the platform connector Unix socket (PLATFORM_CONNECTOR_SOCKET). Existing checks use Go and Python, but any language with gRPC support works.

How injection works

The webhook reads the initContainers list from its config (sourced from the Helm chart preflight.initContainers). Every container in that list is injected as an init container into GPU pods in labeled namespaces. The webhook automatically provides:

NODE_NAME (downward API), PLATFORM_CONNECTOR_SOCKET, PROCESSING_STRATEGY on all init containers
DCGM_DIAG_LEVEL and DCGM_HOSTENGINE_ADDR on the container named preflight-dcgm-diag
GANG_ID, GANG_CONFIG_DIR, GANG_TIMEOUT_SECONDS, POD_NAME on gang-aware containers (when gang coordination is enabled)

Your check reads its configuration from environment variables and exits with code 0 on success, non-zero on failure.

1. Create the check module

mkdir -p preflight-checks/my-check
cd preflight-checks/my-check

For example, with Go:

go mod init github.com/nvidia/nvsentinel/preflight-checks/my-check

2. Implement the check

A preflight check follows this pattern:

Load configuration from environment variables
Run the diagnostic
On failure, send a health event to the platform connector via the Unix domain socket (PLATFORM_CONNECTOR_SOCKET) using the HealthEventOccurredV1 gRPC call, then exit 1
On success, optionally send a healthy event, then exit 0

Go example — see preflight-checks/nccl-loopback/main.go for the full pattern:

reporter := health.NewReporter(cfg.ConnectorSocket, cfg.NodeName, cfg.ProcessingStrategy)
// ... run your diagnostic ...
if failed {
    reporter.SendEvent(ctx, false, true, "my-check failed: ...", "MY_ERROR_CODE")
    os.Exit(1)
}
os.Exit(0)

Python example — see preflight-checks/dcgm-diag/dcgm_diag/__main__.py:

reporter = HealthReporter(socket_path=cfg.connector_socket, node_name=cfg.node_name,
                          processing_strategy=cfg.processing_strategy)
# ... run your diagnostic ...
if failures:
    reporter.send_event(gpu_uuid=uuid, is_healthy=False, is_fatal=True, message="...")
    sys.exit(1)
sys.exit(0)

For the gRPC protobuf definitions, generate from data-models/protobufs/health_event.proto (see existing Makefiles for protos-generate targets).

3. Create the Dockerfile

Build a container image from the repo root context:

FROM <base-image>
COPY preflight-checks/my-check/ /app/
# ... install deps, set entrypoint ...

4. Create the Makefile

Copy from an existing check and update the module name:

cp ../nccl-loopback/Makefile ./Makefile
# Update IMAGE_NAME, MODULE_NAME, and language-specific settings

Key targets: lint-test, docker-build, docker-publish.

5. Register the image in the Helm chart

Add the container to initContainers in distros/kubernetes/nvsentinel/charts/preflight/values.yaml:

initContainers:
  # ... existing checks ...
  - name: preflight-my-check
    image: ghcr.io/nvidia/nvsentinel/preflight-my-check:latest
    env:
      - name: MY_THRESHOLD
        value: "42"
    volumeMounts:
      - name: nvsentinel-socket
        mountPath: /var/run

The nvsentinel-socket volume mount gives the init container access to the platform connector Unix socket. The webhook creates this volume automatically.

6. Add to CI

Add entries in three workflow files:

.github/workflows/lint-test.yml — add your component to the preflight-checks-lint-test matrix
.github/workflows/container-build-test.yml — add a docker-build matrix entry
.github/workflows/publish.yml — add a docker-publish matrix entry
.github/workflows/cleanup-untagged-images.yml — add the image name for cleanup

7. Test locally

# Lint and unit test
make -C preflight-checks/my-check lint-test

# Build the image
make -C preflight-checks/my-check docker-build

# Run the E2E suite (cluster must be running with preflight enabled)
cd tests && go test -tags=amd64_group -run TestPreflightEndToEnd -v ./...

For Tilt development, override initContainers in distros/kubernetes/nvsentinel/values-tilt.yaml to include or stub your check.

See Preflight configuration for the full operator guide including per-check env var reference.

🧪 Testing

Test Strategy

Unit Tests: Test individual functions and methods
Integration Tests: Test module interactions
End-to-End Tests: Test complete workflows via CI

Running Tests

The unified Makefile structure provides consistent testing across all modules:

# Test all modules (delegates to all sub-Makefiles)
make lint-test-all                          # Main Makefile - runs everything

# Test by category
make health-monitors-lint-test-all          # All health monitors
make go-lint-test-all                       # All Go modules (common.mk patterns)

# Test individual modules via delegation (main Makefile)
make health-events-analyzer-lint-test       # Go module
make platform-connectors-lint-test         # Go module
make store-client-lint-test             # Go module
make log-collector-lint-test                # Container module

# Test individual modules directly (common.mk patterns)
make -C health-monitors/syslog-health-monitor lint-test  # Go module
make -C platform-connectors lint-test                   # Go module
make -C health-events-analyzer lint-test                # Go module
make -C health-monitors/gpu-health-monitor lint-test    # Python module

# Use individual targets for development (common.mk)
cd health-monitors/syslog-health-monitor
make vet        # Just go vet
make lint       # Just golangci-lint
make test       # Just tests
make coverage   # Tests + coverage
make build      # Build module
make binary     # Build main binary

# Run specific test with verbose output
cd platform-connectors
go test -v ./pkg/connectors/...

Test Requirements

Each module must include:

Unit tests with _test.go suffix
Coverage reporting via go test -coverprofile
Integration tests where applicable
Mocks for external dependencies

Python Testing (GPU Health Monitor)

# Using the module's Makefile (recommended)
make -C health-monitors/gpu-health-monitor lint-test  # Full lint-test
make -C health-monitors/gpu-health-monitor setup      # Just Poetry setup
make -C health-monitors/gpu-health-monitor lint       # Just Black check
make -C health-monitors/gpu-health-monitor test       # Just tests
make -C health-monitors/gpu-health-monitor format     # Run Black formatter

# Manual Poetry commands
cd health-monitors/gpu-health-monitor
poetry install
poetry run pytest -v
poetry run black --check .
poetry run coverage run --source=gpu_health_monitor -m pytest

📏 Code Standards

Go Standards

Linting: Use golangci-lint with project configuration
Formatting: Use gofmt (enforced by linting)
Imports: Group standard, third-party, and local imports
Error Handling: Always check and handle errors appropriately
Context: Pass context.Context for cancellation and timeouts

Code Review Checklist

License Headers

All source files must include the Apache 2.0 license header:

# Add license headers to new files
addlicense -f .github/headers/LICENSE .

# Check license headers
make license-headers-lint

🚀 CI/CD Pipeline

GitHub Actions Workflows

The project uses GitHub Actions for continuous integration with the following workflows:

lint-test.yml: Code quality and testing
- Runs lint-test on all modules using standardized Makefile targets
- Includes health monitors, Go modules, Python linting, shell script validation
- Uses matrix strategy for parallel execution across components
container-build-test.yml: Container build validation
- Validates Docker builds for all modules can complete successfully
- Uses the standardized docker-build targets from individual modules
- Runs on pull requests affecting container-related files
e2e-test.yml: End-to-end testing
- Sets up Kind cluster with ctlptl for full integration testing
- Uses Tilt for deployment and testing
publish.yml: Container image publishing
release.yml: Semantic release automation

Local CI Simulation

# Run the same commands as GitHub Actions locally
make lint-test-all                              # Matches lint-test.yml workflow

# Individual module CI commands (common.mk patterns)
make -C health-monitors/syslog-health-monitor lint-test
make -C health-monitors/gpu-health-monitor lint-test
make -C platform-connectors lint-test           # Uses common.mk patterns
make -C log-collector lint-test      # Shell + Python linting

# Container builds (matches container-build-test.yml)
make -C health-monitors/syslog-health-monitor docker-build
make -C platform-connectors docker-build

# Or run individual steps for debugging (common.mk targets)
cd health-monitors/syslog-health-monitor
make vet        # go vet ./...
make lint       # golangci-lint run
make test       # gotestsum with coverage
make coverage   # generate coverage reports

# Manual commands (what common.mk executes)
go vet ./...
golangci-lint run --config ../.golangci.yml  # Output format configured in .golangci.yml v2
gotestsum --junitfile report.xml -- -race $(go list ./...) -coverprofile=coverage.txt -covermode atomic

GitHub Actions Environment

The CI environment uses:

Consistent tool versions managed in .versions.yaml
Shared build environment setup via .github/actions/setup-build-env
Artifact uploads for test results and coverage reports
Private repository access handled via SSH keys

🐛 Debugging

Local Development Debugging

Tilt Debugging

# Start Tilt with Makefile (recommended)
make tilt-up
# Navigate to http://localhost:10350

# Or run Tilt in CI mode (no UI, good for debugging)
make tilt-ci

# Stream logs for specific service
kubectl logs -f deployment/platform-connectors -n nvsentinel

# Access Tilt logs and resource status
tilt get all
tilt logs platform-connectors

gRPC Debugging

# Use grpcurl to test endpoints
grpcurl -plaintext localhost:50051 list
grpcurl -plaintext localhost:50051 platformconnector.PlatformConnector/SendHealthEvent

Common Issues

Module Dependencies

# Clean module cache if dependency issues
go clean -modcache
go mod download

Private Repository Access

# Verify SSH key configuration
ssh -T git@github.com

# Test access
git ls-remote git@github.com:dgxcloud/mk8s/k8s-addons/nvsentinel.git

Container Build Issues

# Clean Docker cache
docker system prune -f

# Rebuild without cache
docker build --no-cache -t platform-connectors platform-connectors/

Shellcheck Version Differences (Log Collector)

# GitHub Actions uses a specific shellcheck version from setup-build-env
# Local shellcheck version may differ, causing different linting results

# Use standardized linting (matches GitHub Actions):
make -C log-collector lint-test    # Standardized pattern
make log-collector-lint                       # Main Makefile delegation

# Install shellcheck locally to match CI:
# macOS: brew install shellcheck
# Ubuntu: apt-get install shellcheck
# See: https://github.com/koalaman/shellcheck#installing

Debugging Setup Script Issues

The scripts/setup-dev-env.sh script installs all development dependencies. If you encounter issues:

Enable Debug Mode

# Run with detailed debugging output
DEBUG=true make dev-env-setup AUTO_MODE=true

# Or run the script directly
DEBUG=true AUTO_MODE=true ./scripts/setup-dev-env.sh

Debug output includes:

Architecture detection (x86_64 → amd64, aarch64 → arm64)
Architecture mappings for different tools (GO_ARCH vs PROTOC_ARCH)
Complete URLs being constructed for downloads
HTTP response codes from URL verification
Version information from .versions.yaml

Common Setup Issues

1. Download Failures (404 errors)

If you see errors like "HTTP 404 Not Found":

# Enable debug mode to see the exact URL
DEBUG=true ./scripts/setup-dev-env.sh

# Verify the URL manually
curl -I "https://github.com/koalaman/shellcheck/releases/download/v0.11.0/shellcheck-v0.11.0.linux.x86_64.tar.xz"

# Check if the release exists on GitHub
# Visit: https://github.com/<owner>/<repo>/releases

Common causes:

Version in .versions.yaml doesn't exist in GitHub releases
Architecture-specific filename doesn't match release assets
Tool project changed their release naming convention

2. Architecture Mismatch

Different tools use different architecture naming:

Go tools (yq, kubectl, helm): amd64, arm64, darwin
Protocol Buffers: x86_64, aarch_64, osx-universal_binary
Shellcheck: x86_64, aarch64, darwin

The script automatically maps these:

# See mappings in debug output
DEBUG=true ./scripts/setup-dev-env.sh
# Look for: "DEBUG: Architecture mappings: Raw ARCH: x86_64, GO_ARCH: amd64, PROTOC_ARCH: x86_64"

3. Permission Issues

# If installation to /usr/local/bin fails
sudo make dev-env-setup AUTO_MODE=true

# Or install to user directory (requires PATH modification)
mkdir -p ~/bin
export PATH="$HOME/bin:$PATH"
# Modify script to use ~/bin instead of /usr/local/bin

4. Network/Proxy Issues

# Test connectivity to GitHub
curl -I https://github.com

# If behind proxy, configure:
export HTTP_PROXY="http://proxy.example.com:8080"
export HTTPS_PROXY="http://proxy.example.com:8080"
export NO_PROXY="localhost,127.0.0.1"

# Then retry
DEBUG=true make dev-env-setup AUTO_MODE=true

5. Version Validation

# Check what versions are configured
cat .versions.yaml

# Verify specific tool version exists
TOOL_VERSION=$(yq eval '.SHELLCHECK_VERSION' .versions.yaml)
echo "Checking shellcheck version: $TOOL_VERSION"
curl -I "https://github.com/koalaman/shellcheck/releases/tag/$TOOL_VERSION"

# Update version if needed
yq eval '.SHELLCHECK_VERSION = "v0.11.0"' -i .versions.yaml

Testing URL Construction

To test URL construction without running the full setup:

# Source the script functions
source scripts/setup-dev-env.sh

# Test specific tool URLs
echo "yq URL: $YQ_URL"
echo "kubectl URL: $KUBECTL_URL"
echo "protoc URL: $PROTOC_URL"
echo "shellcheck URL: $SHELLCHECK_URL"

# Test URL accessibility
curl -I "$SHELLCHECK_URL"

Manual Verification Steps

Verify tool exists in releases:

# Check GitHub releases page
open "https://github.com/koalaman/shellcheck/releases"

# Or use API
curl -s "https://api.github.com/repos/koalaman/shellcheck/releases/latest" | grep browser_download_url

Test download manually:

# Download specific asset
wget "https://github.com/koalaman/shellcheck/releases/download/v0.11.0/shellcheck-v0.11.0.linux.x86_64.tar.xz"

# Verify archive integrity
tar -tzf shellcheck-v0.11.0.linux.x86_64.tar.xz

Validate script syntax:

# Check for syntax errors
bash -n scripts/setup-dev-env.sh

# Run shellcheck on the script itself
shellcheck scripts/setup-dev-env.sh

Reporting Setup Issues

When reporting setup script issues, include:

Debug output:

DEBUG=true make dev-env-setup AUTO_MODE=true 2>&1 | tee setup-debug.log

Environment details:

echo "OS: $(uname -s)"
echo "Architecture: $(uname -m)"
echo "Shell: $SHELL"
echo "Bash version: $BASH_VERSION"

Version file content:
```
cat .versions.yaml
```

Failed URL (from debug output):

Look for lines like:
❌ ERROR: Failed to verify URL: https://...
HTTP Status: 404 Not Found

This information helps diagnose whether the issue is:

Version-specific (tool version doesn't exist)
Architecture-specific (wrong filename for platform)
Network-related (connectivity or proxy issues)
Script bug (incorrect URL construction logic)

🔧 Makefile Reference

Makefile Structure Overview

The project uses a unified Makefile structure with shared patterns for consistency:

Main Makefile (`./Makefile`)

Acts as the primary coordinator, delegating to specialized sub-Makefiles:

make help                     # Show all available targets
make lint-test-all           # Run full test suite (delegates to all modules)
make health-monitors-lint-test-all  # Delegate to health-monitors/Makefile
make docker-all              # Delegate to docker/Makefile
make dev-env                 # Delegate to dev/Makefile
make kubernetes-distro-lint  # Delegate to distros/kubernetes/Makefile

Common Makefile Patterns (`common.mk`)

Shared build/test/Docker patterns for all Go modules:

# Included by all Go modules with: include ../common.mk
# Provides consistent targets:
all                    # Default target: lint-test
lint-test              # Full lint and test (matches CI)
vet, lint, test, coverage, build, binary  # Individual steps
docker-build, docker-publish              # Docker targets (if HAS_DOCKER=1)
setup-buildx, clean, help                 # Utility targets

Health Monitors Makefile (`health-monitors/Makefile`)

Coordinates all health monitoring modules:

make -C health-monitors help                    # Show health monitor targets
make -C health-monitors lint-test-all          # Test all health monitors
make -C health-monitors go-lint-test-all       # Test Go health monitors
make -C health-monitors python-lint-test-all   # Test Python health monitors
make -C health-monitors build-all              # Build all health monitors

Individual Module Makefiles (Go modules use `common.mk`)

Each Go module includes common.mk for consistent patterns:

# All Go modules have identical interface via common.mk
make -C health-monitors/syslog-health-monitor help       # Help target
make -C health-monitors/syslog-health-monitor lint-test  # Full lint-test
make -C platform-connectors lint-test                   # Same pattern
make -C health-events-analyzer lint-test                # Same pattern

# Individual development steps (common.mk patterns)
make -C health-monitors/syslog-health-monitor vet        # go vet ./...
make -C health-monitors/syslog-health-monitor lint       # golangci-lint run
make -C health-monitors/syslog-health-monitor test       # gotestsum
make -C health-monitors/syslog-health-monitor coverage   # coverage reports
make -C health-monitors/syslog-health-monitor build      # go build ./...
make -C health-monitors/syslog-health-monitor binary     # go build main binary

Docker Makefile (`docker/Makefile`)

Delegation-based Docker build system:

make -C docker help                    # Show all Docker targets and configuration

# Main build targets (delegates to individual modules)
make -C docker build-all              # Build all images (delegates to modules)
make -C docker publish-all            # Build and push all images
make -C docker setup-buildx           # Setup Docker buildx builder

# Group targets
make -C docker build-health-monitors  # Build all health monitor images
make -C docker build-main-modules     # Build all non-health-monitor images

# Individual module targets (delegates to module Makefiles)
make -C docker build-syslog-health-monitor          # Calls module's docker-build
make -C docker build-csp-health-monitor             # Calls module's docker-build
make -C docker build-gpu-health-monitor-dcgm3       # Calls module's docker-build-dcgm3
make -C docker build-gpu-health-monitor-dcgm4       # Calls module's docker-build-dcgm4
make -C docker build-platform-connectors            # Calls module's docker-build
make -C docker build-health-events-analyzer         # Calls module's docker-build
make -C docker build-log-collector                  # Calls module's docker-build

# Publish targets (delegates to modules)
make -C docker publish-syslog-health-monitor        # Calls module's docker-publish
make -C docker publish-all                          # Calls all modules' docker-publish

# Utility targets
make -C docker clean                  # Remove all nvsentinel images
make -C docker list                   # List built nvsentinel images

Key Features:

Delegation-based: Each module is single source of truth for its Docker config
Multi-platform builds: linux/arm64,linux/amd64 via common.mk
Build caching: Registry-based cache for faster builds
Consistent patterns: Go modules use common.mk, specialized for Python/shell
Dynamic tagging: Uses ${SAFE_REF_NAME} from branch/tag names
Registry integration: Full NVCR.io paths and authentication

Development Makefile (`dev/Makefile`)

Focused on development environment:

make -C dev help           # Show development targets
make -C dev env-up         # Create cluster + start Tilt
make -C dev env-down       # Stop Tilt + delete cluster
make -C dev cluster-create # Create Kind cluster
make -C dev tilt-up        # Start Tilt
make -C dev cluster-status # Check cluster status

Kubernetes Makefile (`distros/kubernetes/Makefile`)

Helm and Kubernetes operations:

make -C distros/kubernetes help         # Show Kubernetes targets
make -C distros/kubernetes lint         # Lint Helm charts
make -C distros/kubernetes helm-publish # Publish Helm chart

Development Workflow Examples

# 1. Full development cycle
make dev-env                           # Start development environment
make lint-test-all                     # Test all modules
make docker-all                        # Build containers (delegates to modules)
make dev-env-clean                     # Clean up

# 2. Individual module development (common.mk patterns)
make platform-connectors-lint-test     # Test specific Go module (main Makefile)
make -C platform-connectors lint-test  # Test directly (common.mk pattern)
make -C platform-connectors docker-build  # Build container (common.mk)

# 3. Focused development on specific module (common.mk targets)
cd platform-connectors
make lint-test    # Full module test
make vet          # Quick syntax check
make test         # Run tests only
make build        # Build module
make binary       # Build main binary

# 4. Health monitors (coordination + common.mk patterns)
make health-monitors-lint-test-all      # All health monitors
make -C health-monitors/syslog-health-monitor lint-test  # Specific health monitor

Build System Benefits

All Go modules use consistent patterns via common.mk:

Consistent targets: lint-test, vet, lint, test, build, binary
Docker integration: docker-build, docker-publish (if HAS_DOCKER=1)
Unified configuration: Same environment variables and build flags
Backwards compatibility: Legacy targets (image, publish) still work

📚 Common Tasks

Adding a New Dependency

For Go modules:

cd your-module/
go get github.com/new/dependency@v1.2.3
go mod tidy

For Python modules:

cd health-monitors/gpu-health-monitor/
poetry add new-dependency

Updating Protobuf Definitions

Edit .proto files in protobufs/ directory
Regenerate code:
```
make protos-lint
```
Update affected modules and test

Adding New Configuration Options

Update Helm values in distros/kubernetes/nvsentinel/values.yaml
Update templates in distros/kubernetes/nvsentinel/templates/
Update module code to read new configuration
Test with Tilt or manual Helm install

Performance Profiling

# Enable pprof in Go applications
import _ "net/http/pprof"

# Access profiles
go tool pprof http://localhost:6060/debug/pprof/profile
go tool pprof http://localhost:6060/debug/pprof/heap

Database Schema Changes

Never break backward compatibility
Add fields with default values
Use MongoDB schema validation if needed
Test with existing data

🎯 Best Practices

Development Best Practices

Start Small: Make incremental changes
Test Early: Write tests alongside code
Document Changes: Update relevant documentation
Review Dependencies: Minimize external dependencies
Monitor Resources: Be aware of CPU/memory usage

Kubernetes Best Practices

Resource Limits: Always set resource requests/limits
Health Checks: Implement readiness and liveness probes
Graceful Shutdown: Handle SIGTERM properly
Security Context: Run with minimal privileges
Observability: Emit metrics and structured logs

MongoDB Best Practices

Indexes: Create appropriate indexes for queries
Connection Pooling: Reuse connections efficiently
Change Streams: Use resume tokens for reliability
Error Handling: Handle network partitions gracefully

🎯 Usage Examples:

Local Development Workflow:

# Build for local testing (loads into local Docker daemon)
make -C docker build-syslog-health-monitor           # Individual module
make -C docker build-all                          # All modules
make -C health-monitors/gpu-health-monitor docker-build-dcgm3  # Specific variant

# Test the built images locally
# Test the built images locally
docker run ghcr.io/your-github-username/nvsentinel-syslog-health-monitor:local

CI/Production Workflow:

# Environment setup (matches GitHub Actions)
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"
export CI_COMMIT_REF_NAME="main"
# Authentication handled by docker login to ghcr.io

# Build and push directly to registry (standardized patterns)
make -C docker publish-syslog-health-monitor         # Individual module
make -C docker publish-all                        # All modules
make -C health-monitors/gpu-health-monitor docker-publish  # Both DCGM variants

Development vs CI Behavior:

# Development: Fast local build (recommended)
make -C health-monitors/syslog-health-monitor docker-build-local

# Development: Full featured build (slower, like CI)
make -C health-monitors/syslog-health-monitor docker-build

# CI/Production: Build and push with --push (standardized)
make -C health-monitors/syslog-health-monitor docker-publish

📞 Getting Help

Internal Documentation: Check module-specific READMEs and make help targets
GitHub Issues: Report bugs and feature requests
Team Chat: Reach out to the development team
Code Reviews: Learn from feedback on pull requests
Makefile Help: Use make help in any module for target documentation
Common Patterns: All Go modules follow common.mk patterns for consistency

Happy coding! 🚀

For questions about this guide or the development process, please reach out to the NVSentinel development team.

FilesExpand file tree

DEVELOPMENT.md

Latest commit

History