This guide covers project structure, development workflows, and best practices for contributing to NVSentinel.
- Getting Started
- Project Architecture
- Development Environment Setup
- Development Workflows
- Module Development
- Testing
- Code Standards
- CI/CD Pipeline
- Debugging
- Makefile Reference
git clone https://github.com/nvidia/nvsentinel.git
cd nvsentinel
make dev-env-setup # Install all dependenciesNVSentinel uses .versions.yaml for centralized version management across:
- Local development
- CI/CD pipelines
- Container builds
View current versions:
make show-versionsCore Tools (required):
- Go 1.25+ - See
.versions.yamlfor exact version - Docker
- kubectl
- Helm 3.0+
- Protocol Buffers Compiler
- yq - YAML processor for version management
Development Tools:
Optional (for local Kubernetes development):
Quick install (installs and configures all tools):
make dev-env-setupThis will:
- Detect your OS (Linux/macOS) and architecture
- Install yq and check for required tools
- Install development and Go tools
- Configure Python gRPC tools
Automated installation: To skip interactive prompts and auto-install all dependencies:
make dev-env-setup AUTO_MODE=trueDebugging setup issues: If the setup script fails, enable debug mode for detailed output:
DEBUG=true make dev-env-setup AUTO_MODE=trueThis will show:
- Architecture detection and mappings
- URL construction for all downloads
- HTTP response codes for failed downloads
- Detailed error messages with suggestions
Common setup issues and solutions are documented in the Debugging section.
Unified build system features:
- Consistent interface: All modules support common targets (
all,lint-test,clean) - Technology-aware: Appropriate tooling for Go, Python, and shell scripts
- Delegation pattern: Top-level Makefiles delegate to individual modules
- Repo-root context: Docker builds use consistent paths
- Multi-platform support: Built-in
linux/arm64,linux/amd64via Docker buildx
NVSentinel follows a microservices architecture with event-driven communication:
- Independence: Modules operate autonomously
- Event-Driven: Communication through MongoDB change streams
- Modular: Pluggable health monitors
- Cloud-Native: Kubernetes-first design
nvsentinel/
├── health-monitors/ # Hardware/software fault detection
│ ├── gpu-health-monitor/ # Python - DCGM GPU monitoring
│ ├── syslog-health-monitor/ # Go - System log monitoring
│ └── csp-health-monitor/ # Go - Cloud provider monitoring
├── platform-connectors/ # gRPC event ingestion service
├── fault-quarantine/ # CEL-based event quarantine logic
├── fault-remediation/ # Kubernetes controller for remediation
├── health-events-analyzer/ # Event analysis and correlation
├── health-event-client/ # Event streaming client
├── labeler/ # Node labeling controller
├── node-drainer/ # Graceful workload eviction
├── preflight/ # Admission webhook controller (Go)
├── preflight-checks/ # Init container check images
│ ├── dcgm-diag/ # Python - DCGM GPU diagnostics
│ ├── nccl-loopback/ # Go - Single-node NCCL test
│ └── nccl-allreduce/ # Python - Multi-node NCCL test
├── store-client/ # MongoDB interaction library (tested in CI)
└── log-collector/ # Log aggregation (shell scripts)
sequenceDiagram
participant HM as Health Monitor
participant PC as Platform Connectors
participant DB as MongoDB
participant FM as Fault Module
HM->>PC: gRPC health event
PC->>DB: Store event
DB->>FM: Change stream notification
FM->>DB: Query related events
FM->>K8s: Execute remediation action
Tilt provides the fastest development experience with hot reloading.
# Quick start - create cluster and start Tilt in one command
make dev-env # Creates cluster and starts Tilt
# Manual step-by-step approach
make cluster-create # Creates ctlptl-managed Kind cluster with registry
make tilt-up # Starts Tilt with UI (runs: tilt up -f tilt/Tiltfile)
# Check status
make cluster-status # Check cluster and registry status
# View Tilt UI
# Navigate to http://localhost:10350
# Stop everything when done
make dev-env-clean # Stops Tilt and deletes cluster
# Or stop individually
make tilt-down # Stops Tilt (runs: tilt down -f tilt/Tiltfile)
make cluster-delete # Deletes the clusterctlptl Cluster Features:
- Declarative cluster configuration with YAML
- Multi-node Kind cluster (3 control-plane, 2 worker nodes)
- Cluster name:
kind-nvsentinel(requiredkind-prefix) - Integrated local container registry at
localhost:5001 - Automatic registry configuration for Tilt
- Simplified cluster lifecycle management
- No external dependencies beyond Docker, ctlptl, and Kind
For module-specific development without full cluster:
# Set up Go environment
export GOPATH=$(go env GOPATH)
export GO_CACHE_DIR=$(go env GOCACHE)
# Install development dependencies
go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest
go install gotest.tools/gotestsum@latest
go install github.com/boumenot/gocover-cobertura@latest
# For controller modules
go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latestGo module dependencies are handled automatically:
# Dependencies managed via go.mod files with replace directives for local development
# No manual GOPRIVATE configuration needed
# Private repository authentication handled via SSH keys-
Start Development Session
git checkout main git pull origin main git checkout -b feature/your-feature-name # Start local development environment make dev-env # Creates ctlptl-managed cluster and starts Tilt
-
Develop with Live Reload
# Edit code - Tilt automatically rebuilds and redeploys vim health-monitors/syslog-health-monitor/pkg/monitor/monitor.go # View logs in Tilt UI at http://localhost:10350 # Or use kubectl for specific logs (note: syslog-health-monitor runs as DaemonSet with -regular and -kata variants) kubectl logs -f daemonset/nvsentinel-syslog-health-monitor-regular -n nvsentinel
-
Test Changes
# Run tests locally (while Tilt is running) make health-monitors-lint-test-all # All health monitors make health-events-analyzer-lint-test # Specific Go module make platform-connectors-lint-test # Another Go module # Or run individual module tests directly (using standardized targets) make -C health-monitors/syslog-health-monitor lint-test make -C platform-connectors lint-test make -C health-events-analyzer lint-test # Preflight E2E (cluster with preflight enabled, e.g. after make dev-env) # cd tests && go test -tags=amd64_group -run TestPreflightEndToEnd -v ./... # Test integration with other services via Tilt UI # Access services via port-forwards set up by Tilt
-
Validate Before Commit
# Run full test suite make lint-test-all # Stop Tilt for final testing if needed make tilt-down
-
Commit and Push
git add . git commit -s -m "feat: add new monitoring capability" git push origin feature/your-feature-name # Clean up development environment make dev-env-clean
When modifying .proto files:
# Generate protobuf files
make protos-lint
# This runs:
# - protoc generation for Go modules
# - Python protobuf generation for GPU monitor
# - Import path fixes for Python
# - Git diff check to ensure files are up to dateThe project provides a unified Docker build system with consistent patterns across all modules. All builds support multi-platform architecture, build caching, and proper context management.
Set these for production-like builds:
# Docker configuration (standardized across all modules)
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username" # Defaults to repository owner
export CI_COMMIT_REF_NAME="feature-branch" # Or your branch name
# These are computed automatically by common.mk:
# SAFE_REF_NAME=$(echo $CI_COMMIT_REF_NAME | sed 's/\//-/g')
# PLATFORMS="linux/arm64,linux/amd64"
# MODULE_NAME=$(basename $(CURDIR))Build System Overview
The Docker build system uses shared patterns via common.mk for Go modules, with specialized handling for Python and container-only modules. Each module maintains its own Docker configuration.
Main build targets (delegated to individual modules):
# Local Development (--load) - builds images into local Docker daemon
make docker-all # All images locally (delegates to docker/Makefile)
make docker-health-monitors # All health monitor images locally
make docker-main-modules # All non-health-monitor images locally
# CI/Production (--push) - builds and pushes directly to registry
make docker-publish-all # Build and push all images to registry
make docker-publish-health-monitors # Build and push health monitor images
make docker-publish-main-modules # Build and push main module images
# Individual module targets (via common.mk)
make docker-syslog-health-monitor # Build syslog health monitor locally
make docker-publish-syslog-health-monitor # Build and push to registry
make docker-platform-connectors # Build platform connectors locally
make docker-publish-platform-connectors # Build and push to registry
# Special cases
make docker-gpu-health-monitor # Both DCGM 3.x and 4.x versions locally
make docker-log-collector # Container-only module (shell + Python)Direct docker/ Makefile usage:
cd docker
# Local development builds (--load)
make build-all # Build all 12 images locally
make build-health-monitors # Build health monitor group locally
make build-syslog-health-monitor # Build specific module locally
# CI/production builds (--push)
make publish-all # Build and push all images to registry
make publish-syslog-health-monitor # Build and push specific image to registry
# Utility commands
make setup-buildx # Setup multi-platform builder
make clean # Remove all nvsentinel images
make list # List built nvsentinel images
make help # Show all available targetsIndividual module usage:
# Go modules (common.mk patterns)
make -C health-monitors/syslog-health-monitor docker-build # Local build with remote cache
make -C health-monitors/syslog-health-monitor docker-build-local # Local build, no remote cache (faster)
make -C health-monitors/syslog-health-monitor docker-publish # CI build
make -C platform-connectors docker-build # Local build with remote cache
make -C platform-connectors docker-build-local # Local build, no remote cache (faster)
make -C platform-connectors docker-publish # CI build
make -C health-events-analyzer docker-build-local # Local build, no remote cache (faster)
make -C health-events-analyzer docker-publish # CI build
# Python module (specialized patterns)
make -C health-monitors/gpu-health-monitor docker-build-dcgm3 # DCGM 3.x local
make -C health-monitors/gpu-health-monitor docker-publish-dcgm4 # DCGM 4.x CI
# Container-only module (shell + Python)
make -C log-collector docker-build-log-collector # Local build
make -C log-collector docker-publish-log-collector # CI buildEach module provides Docker targets with common patterns:
# Go modules (common.mk patterns)
make -C health-monitors/syslog-health-monitor docker-build # Local with remote cache
make -C health-monitors/syslog-health-monitor docker-build-local # Local, no remote cache (recommended)
make -C health-monitors/syslog-health-monitor docker-publish # CI/production
make -C platform-connectors docker-build # Local with remote cache
make -C platform-connectors docker-build-local # Local, no remote cache (recommended)
make -C platform-connectors docker-publish # CI/production
# Python module (specialized patterns)
make -C health-monitors/gpu-health-monitor docker-build-dcgm3 # DCGM 3.x local
make -C health-monitors/gpu-health-monitor docker-build-dcgm4 # DCGM 4.x local
make -C health-monitors/gpu-health-monitor docker-publish # Push both versions
# Container-only module (shell + Python)
make -C log-collector docker-build # Both log-collector and file-server-cleanup
make -C log-collector docker-publish # Push both components
# Legacy compatibility (all modules)
make -C [module] image # Calls docker-build
make -C [module] publish # Calls docker-publishAll builds support consistent features:
- Multi-Platform Support:
linux/arm64,linux/amd64viacommon.mk(docker-build);linux/amd64fordocker-build-local - Build Caching: Registry-based build cache for faster builds (
docker-build); local cache only (docker-build-local) - Repo-Root Context: All builds use consistent repo-root context
- Dynamic Tagging: Uses branch/tag name (
${SAFE_REF_NAME}) fordocker-build; simplemodule:localfordocker-build-local - Registry Integration: NVCR.io registry paths for
docker-buildanddocker-publish - Module Auto-Detection: Automatic module name detection via
$(MODULE_NAME)
For Local Development, use docker-build-local to avoid registry authentication issues and build faster.
Local Development:
# Recommended: Fast local build (single platform, no remote cache)
make -C health-monitors/syslog-health-monitor docker-build-local
# Alternative: Local build with remote cache (multi-platform, slower)
make -C health-monitors/syslog-health-monitor docker-build
# Legacy: Quick local build
make -C health-monitors/syslog-health-monitor imageCI-like Build:
**CI-like Build:**
```bash
# Set up environment like GitHub Actions
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"
export CI_COMMIT_REF_NAME="main"
# Build all images with full CI features (standardized)
make docker-all
# Images will be tagged like:
# ghcr.io/your-github-username/syslog-health-monitor:main
# ghcr.io/nvidia/nvsentinel/gpu-health-monitor:main-dcgm-3.x
# ghcr.io/nvidia/nvsentinel/gpu-health-monitor:main-dcgm-4.xTesting Specific Module:
# Recommended: Build and test individual module (fast, local)
make -C platform-connectors docker-build-local
docker run --rm platform-connectors:local --help
# Alternative: Build with full CI features
make docker-platform-connectors
docker run --rm ghcr.io/nvidia/nvsentinel/platform-connectors:fix-make-file-targets --help
# Build private repo module (fast, local)
make -C health-events-analyzer docker-build-localThe new system uses Docker BuildKit registry cache:
- First build: Downloads and caches layers
- Subsequent builds: Reuses cached layers for 10x+ speed improvement
- Multi-developer: Cache shared across team via registry
Build failures:
# Check buildx setup
make -C docker setup-buildx
# Clean and retry
make -C docker clean
docker system prune -f
make docker-syslog-health-monitorPrivate repo access:
# Verify SSH key access
git ls-remote git@github.com:dgxcloud/mk8s/some-private-repo.git
# Build with debug output
BUILDKIT_PROGRESS=plain make docker-csp-health-monitorRegistry issues:
# Test registry login
docker login nvcr.io -u '$oauthtoken' -p "$NGC_PASSWORD"
# Check image tags
make -C docker listProblem: On macOS with Docker Desktop, Unix domain sockets require the /var/run directory to exist inside containers, but this directory is not created by default in minimal container images.
Symptoms:
- Services fail to start with errors like:
failed to listen on unix socket /var/run/nvsentinel.sock: no such file or directory - Tilt-based tests fail on macOS but pass on Linux
- gRPC Unix socket connections fail
Solution: The project includes a Tilt-specific Helm values file that creates the /var/run directory using an initContainer:
# File: distros/kubernetes/nvsentinel/values-tilt-socket.yaml
#
# This values file is automatically included when running Tilt on macOS/Docker Desktop.
# It adds an initContainer to create /var/run directory for Unix socket communication.
global:
initContainers:
- name: create-run-dir
image: busybox:latest
command: ['sh', '-c', 'mkdir -p /var/run']
volumeMounts:
- name: socket-dir
mountPath: /var/runHow it works:
- The
tilt/Tiltfileautomatically includesvalues-tilt-socket.yamlfor local development - The initContainer runs before each service starts and creates the
/var/rundirectory - Services can then create Unix sockets at
/var/run/nvsentinel.sock - The socket directory is shared via an
emptyDirvolume mount
Platform-specific behavior:
- macOS/Docker Desktop: Requires the initContainer workaround (automatically applied in Tilt)
- Linux: The
/var/rundirectory typically exists in the container runtime environment - Production/Kubernetes: Uses standard Helm values without the initContainer (not needed)
Note: This is a development-only workaround for local macOS environments. Production deployments on Linux do not require this configuration.
-
Create Module Structure
mkdir -p health-monitors/my-monitor/{cmd,pkg,internal} cd health-monitors/my-monitor -
Initialize Go Module
go mod init github.com/nvidia/nvsentinel/health-monitors/my-monitor
-
Create Module Makefile
# Copy template from existing health monitor cp ../syslog-health-monitor/Makefile ./Makefile # Update module-specific settings sed -i 's/syslog-health-monitor/my-monitor/g' Makefile sed -i 's/Syslog Health Monitor/My Monitor/g' Makefile
-
Implement gRPC Client
// pkg/monitor/monitor.go package monitor import ( "context" pb "github.com/nvidia/nvsentinel/platform-connectors/pkg/protos" "google.golang.org/grpc" ) type Monitor struct { client pb.PlatformConnectorClient } func (m *Monitor) SendEvent(ctx context.Context, event *pb.HealthEvent) error { _, err := m.client.SendHealthEvent(ctx, event) return err }
-
Update health-monitors/Makefile
# Add your module to the health monitors list # Edit health-monitors/Makefile: # - Add 'my-monitor' to GO_HEALTH_MONITORS list # - Add lint-test delegation target # - Add build delegation target # - Add clean delegation target
-
Test Your Module
# Test the individual module make -C health-monitors/my-monitor lint-test # Test via health-monitors coordination make -C health-monitors lint-test-my-monitor # Test via main Makefile delegation make health-monitors-lint-test-all
-
Add to CI Pipeline The module will automatically be included in GitHub Actions workflows due to the standardized patterns.
-
Follow Kubernetes Controller Pattern
# Use controller-runtime for Kubernetes controllers go get sigs.k8s.io/controller-runtime -
Implement MongoDB Change Streams
// Use store-client for MongoDB operations import "github.com/nvidia/nvsentinel/store-client/pkg/client"
-
Add Proper RBAC Create Kubernetes RBAC manifests in
distros/kubernetes/nvsentinel/templates/.
Preflight checks run as init containers injected by the preflight webhook. Each check is an independent container image under preflight-checks/. Checks are language-agnostic — the only requirement is a container that reads configuration from environment variables, runs a diagnostic, and on failure sends a HealthEventOccurredV1 gRPC call to the platform connector Unix socket (PLATFORM_CONNECTOR_SOCKET). Existing checks use Go and Python, but any language with gRPC support works.
The webhook reads the initContainers list from its config (sourced from the Helm chart preflight.initContainers). Every container in that list is injected as an init container into GPU pods in labeled namespaces. The webhook automatically provides:
NODE_NAME(downward API),PLATFORM_CONNECTOR_SOCKET,PROCESSING_STRATEGYon all init containersDCGM_DIAG_LEVELandDCGM_HOSTENGINE_ADDRon the container namedpreflight-dcgm-diagGANG_ID,GANG_CONFIG_DIR,GANG_TIMEOUT_SECONDS,POD_NAMEon gang-aware containers (when gang coordination is enabled)
Your check reads its configuration from environment variables and exits with code 0 on success, non-zero on failure.
mkdir -p preflight-checks/my-check
cd preflight-checks/my-checkFor example, with Go:
go mod init github.com/nvidia/nvsentinel/preflight-checks/my-checkA preflight check follows this pattern:
- Load configuration from environment variables
- Run the diagnostic
- On failure, send a health event to the platform connector via the Unix domain socket (
PLATFORM_CONNECTOR_SOCKET) using theHealthEventOccurredV1gRPC call, thenexit 1 - On success, optionally send a healthy event, then
exit 0
Go example — see preflight-checks/nccl-loopback/main.go for the full pattern:
reporter := health.NewReporter(cfg.ConnectorSocket, cfg.NodeName, cfg.ProcessingStrategy)
// ... run your diagnostic ...
if failed {
reporter.SendEvent(ctx, false, true, "my-check failed: ...", "MY_ERROR_CODE")
os.Exit(1)
}
os.Exit(0)Python example — see preflight-checks/dcgm-diag/dcgm_diag/__main__.py:
reporter = HealthReporter(socket_path=cfg.connector_socket, node_name=cfg.node_name,
processing_strategy=cfg.processing_strategy)
# ... run your diagnostic ...
if failures:
reporter.send_event(gpu_uuid=uuid, is_healthy=False, is_fatal=True, message="...")
sys.exit(1)
sys.exit(0)For the gRPC protobuf definitions, generate from data-models/protobufs/health_event.proto (see existing Makefiles for protos-generate targets).
Build a container image from the repo root context:
FROM <base-image>
COPY preflight-checks/my-check/ /app/
# ... install deps, set entrypoint ...Copy from an existing check and update the module name:
cp ../nccl-loopback/Makefile ./Makefile
# Update IMAGE_NAME, MODULE_NAME, and language-specific settingsKey targets: lint-test, docker-build, docker-publish.
Add the container to initContainers in distros/kubernetes/nvsentinel/charts/preflight/values.yaml:
initContainers:
# ... existing checks ...
- name: preflight-my-check
image: ghcr.io/nvidia/nvsentinel/preflight-my-check:latest
env:
- name: MY_THRESHOLD
value: "42"
volumeMounts:
- name: nvsentinel-socket
mountPath: /var/runThe nvsentinel-socket volume mount gives the init container access to the platform connector Unix socket. The webhook creates this volume automatically.
Add entries in three workflow files:
.github/workflows/lint-test.yml— add your component to thepreflight-checks-lint-testmatrix.github/workflows/container-build-test.yml— add adocker-buildmatrix entry.github/workflows/publish.yml— add adocker-publishmatrix entry.github/workflows/cleanup-untagged-images.yml— add the image name for cleanup
# Lint and unit test
make -C preflight-checks/my-check lint-test
# Build the image
make -C preflight-checks/my-check docker-build
# Run the E2E suite (cluster must be running with preflight enabled)
cd tests && go test -tags=amd64_group -run TestPreflightEndToEnd -v ./...For Tilt development, override initContainers in distros/kubernetes/nvsentinel/values-tilt.yaml to include or stub your check.
See Preflight configuration for the full operator guide including per-check env var reference.
- Unit Tests: Test individual functions and methods
- Integration Tests: Test module interactions
- End-to-End Tests: Test complete workflows via CI
The unified Makefile structure provides consistent testing across all modules:
# Test all modules (delegates to all sub-Makefiles)
make lint-test-all # Main Makefile - runs everything
# Test by category
make health-monitors-lint-test-all # All health monitors
make go-lint-test-all # All Go modules (common.mk patterns)
# Test individual modules via delegation (main Makefile)
make health-events-analyzer-lint-test # Go module
make platform-connectors-lint-test # Go module
make store-client-lint-test # Go module
make log-collector-lint-test # Container module
# Test individual modules directly (common.mk patterns)
make -C health-monitors/syslog-health-monitor lint-test # Go module
make -C platform-connectors lint-test # Go module
make -C health-events-analyzer lint-test # Go module
make -C health-monitors/gpu-health-monitor lint-test # Python module
# Use individual targets for development (common.mk)
cd health-monitors/syslog-health-monitor
make vet # Just go vet
make lint # Just golangci-lint
make test # Just tests
make coverage # Tests + coverage
make build # Build module
make binary # Build main binary
# Run specific test with verbose output
cd platform-connectors
go test -v ./pkg/connectors/...Each module must include:
- Unit tests with
_test.gosuffix - Coverage reporting via
go test -coverprofile - Integration tests where applicable
- Mocks for external dependencies
# Using the module's Makefile (recommended)
make -C health-monitors/gpu-health-monitor lint-test # Full lint-test
make -C health-monitors/gpu-health-monitor setup # Just Poetry setup
make -C health-monitors/gpu-health-monitor lint # Just Black check
make -C health-monitors/gpu-health-monitor test # Just tests
make -C health-monitors/gpu-health-monitor format # Run Black formatter
# Manual Poetry commands
cd health-monitors/gpu-health-monitor
poetry install
poetry run pytest -v
poetry run black --check .
poetry run coverage run --source=gpu_health_monitor -m pytest- Linting: Use
golangci-lintwith project configuration - Formatting: Use
gofmt(enforced by linting) - Imports: Group standard, third-party, and local imports
- Error Handling: Always check and handle errors appropriately
- Context: Pass
context.Contextfor cancellation and timeouts
- All tests pass
- Code coverage maintained or improved
- No linting violations
- Proper error handling
- Documentation updated
- License headers present
- Signed commits (
git commit -s)
All source files must include the Apache 2.0 license header:
# Add license headers to new files
addlicense -f .github/headers/LICENSE .
# Check license headers
make license-headers-lintThe project uses GitHub Actions for continuous integration with the following workflows:
-
lint-test.yml: Code quality and testing- Runs
lint-teston all modules using standardized Makefile targets - Includes health monitors, Go modules, Python linting, shell script validation
- Uses matrix strategy for parallel execution across components
- Runs
-
container-build-test.yml: Container build validation- Validates Docker builds for all modules can complete successfully
- Uses the standardized
docker-buildtargets from individual modules - Runs on pull requests affecting container-related files
-
e2e-test.yml: End-to-end testing- Sets up Kind cluster with ctlptl for full integration testing
- Uses Tilt for deployment and testing
-
publish.yml: Container image publishing -
release.yml: Semantic release automation
# Run the same commands as GitHub Actions locally
make lint-test-all # Matches lint-test.yml workflow
# Individual module CI commands (common.mk patterns)
make -C health-monitors/syslog-health-monitor lint-test
make -C health-monitors/gpu-health-monitor lint-test
make -C platform-connectors lint-test # Uses common.mk patterns
make -C log-collector lint-test # Shell + Python linting
# Container builds (matches container-build-test.yml)
make -C health-monitors/syslog-health-monitor docker-build
make -C platform-connectors docker-build
# Or run individual steps for debugging (common.mk targets)
cd health-monitors/syslog-health-monitor
make vet # go vet ./...
make lint # golangci-lint run
make test # gotestsum with coverage
make coverage # generate coverage reports
# Manual commands (what common.mk executes)
go vet ./...
golangci-lint run --config ../.golangci.yml # Output format configured in .golangci.yml v2
gotestsum --junitfile report.xml -- -race $(go list ./...) -coverprofile=coverage.txt -covermode atomicThe CI environment uses:
- Consistent tool versions managed in
.versions.yaml - Shared build environment setup via
.github/actions/setup-build-env - Artifact uploads for test results and coverage reports
- Private repository access handled via SSH keys
-
Tilt Debugging
# Start Tilt with Makefile (recommended) make tilt-up # Navigate to http://localhost:10350 # Or run Tilt in CI mode (no UI, good for debugging) make tilt-ci # Stream logs for specific service kubectl logs -f deployment/platform-connectors -n nvsentinel # Access Tilt logs and resource status tilt get all tilt logs platform-connectors
-
gRPC Debugging
# Use grpcurl to test endpoints grpcurl -plaintext localhost:50051 list grpcurl -plaintext localhost:50051 platformconnector.PlatformConnector/SendHealthEvent
-
Module Dependencies
# Clean module cache if dependency issues go clean -modcache go mod download -
Private Repository Access
# Verify SSH key configuration ssh -T git@github.com # Test access git ls-remote git@github.com:dgxcloud/mk8s/k8s-addons/nvsentinel.git
-
Container Build Issues
# Clean Docker cache docker system prune -f # Rebuild without cache docker build --no-cache -t platform-connectors platform-connectors/
-
Shellcheck Version Differences (Log Collector)
# GitHub Actions uses a specific shellcheck version from setup-build-env # Local shellcheck version may differ, causing different linting results # Use standardized linting (matches GitHub Actions): make -C log-collector lint-test # Standardized pattern make log-collector-lint # Main Makefile delegation # Install shellcheck locally to match CI: # macOS: brew install shellcheck # Ubuntu: apt-get install shellcheck # See: https://github.com/koalaman/shellcheck#installing
The scripts/setup-dev-env.sh script installs all development dependencies. If you encounter issues:
# Run with detailed debugging output
DEBUG=true make dev-env-setup AUTO_MODE=true
# Or run the script directly
DEBUG=true AUTO_MODE=true ./scripts/setup-dev-env.shDebug output includes:
- Architecture detection (x86_64 → amd64, aarch64 → arm64)
- Architecture mappings for different tools (GO_ARCH vs PROTOC_ARCH)
- Complete URLs being constructed for downloads
- HTTP response codes from URL verification
- Version information from
.versions.yaml
1. Download Failures (404 errors)
If you see errors like "HTTP 404 Not Found":
# Enable debug mode to see the exact URL
DEBUG=true ./scripts/setup-dev-env.sh
# Verify the URL manually
curl -I "https://github.com/koalaman/shellcheck/releases/download/v0.11.0/shellcheck-v0.11.0.linux.x86_64.tar.xz"
# Check if the release exists on GitHub
# Visit: https://github.com/<owner>/<repo>/releasesCommon causes:
- Version in
.versions.yamldoesn't exist in GitHub releases - Architecture-specific filename doesn't match release assets
- Tool project changed their release naming convention
2. Architecture Mismatch
Different tools use different architecture naming:
- Go tools (yq, kubectl, helm):
amd64,arm64,darwin - Protocol Buffers:
x86_64,aarch_64,osx-universal_binary - Shellcheck:
x86_64,aarch64,darwin
The script automatically maps these:
# See mappings in debug output
DEBUG=true ./scripts/setup-dev-env.sh
# Look for: "DEBUG: Architecture mappings: Raw ARCH: x86_64, GO_ARCH: amd64, PROTOC_ARCH: x86_64"3. Permission Issues
# If installation to /usr/local/bin fails
sudo make dev-env-setup AUTO_MODE=true
# Or install to user directory (requires PATH modification)
mkdir -p ~/bin
export PATH="$HOME/bin:$PATH"
# Modify script to use ~/bin instead of /usr/local/bin4. Network/Proxy Issues
# Test connectivity to GitHub
curl -I https://github.com
# If behind proxy, configure:
export HTTP_PROXY="http://proxy.example.com:8080"
export HTTPS_PROXY="http://proxy.example.com:8080"
export NO_PROXY="localhost,127.0.0.1"
# Then retry
DEBUG=true make dev-env-setup AUTO_MODE=true5. Version Validation
# Check what versions are configured
cat .versions.yaml
# Verify specific tool version exists
TOOL_VERSION=$(yq eval '.SHELLCHECK_VERSION' .versions.yaml)
echo "Checking shellcheck version: $TOOL_VERSION"
curl -I "https://github.com/koalaman/shellcheck/releases/tag/$TOOL_VERSION"
# Update version if needed
yq eval '.SHELLCHECK_VERSION = "v0.11.0"' -i .versions.yamlTo test URL construction without running the full setup:
# Source the script functions
source scripts/setup-dev-env.sh
# Test specific tool URLs
echo "yq URL: $YQ_URL"
echo "kubectl URL: $KUBECTL_URL"
echo "protoc URL: $PROTOC_URL"
echo "shellcheck URL: $SHELLCHECK_URL"
# Test URL accessibility
curl -I "$SHELLCHECK_URL"-
Verify tool exists in releases:
# Check GitHub releases page open "https://github.com/koalaman/shellcheck/releases" # Or use API curl -s "https://api.github.com/repos/koalaman/shellcheck/releases/latest" | grep browser_download_url
-
Test download manually:
# Download specific asset wget "https://github.com/koalaman/shellcheck/releases/download/v0.11.0/shellcheck-v0.11.0.linux.x86_64.tar.xz" # Verify archive integrity tar -tzf shellcheck-v0.11.0.linux.x86_64.tar.xz
-
Validate script syntax:
# Check for syntax errors bash -n scripts/setup-dev-env.sh # Run shellcheck on the script itself shellcheck scripts/setup-dev-env.sh
When reporting setup script issues, include:
-
Debug output:
DEBUG=true make dev-env-setup AUTO_MODE=true 2>&1 | tee setup-debug.log
-
Environment details:
echo "OS: $(uname -s)" echo "Architecture: $(uname -m)" echo "Shell: $SHELL" echo "Bash version: $BASH_VERSION"
-
Version file content:
cat .versions.yaml
-
Failed URL (from debug output):
Look for lines like: ❌ ERROR: Failed to verify URL: https://... HTTP Status: 404 Not Found
This information helps diagnose whether the issue is:
- Version-specific (tool version doesn't exist)
- Architecture-specific (wrong filename for platform)
- Network-related (connectivity or proxy issues)
- Script bug (incorrect URL construction logic)
The project uses a unified Makefile structure with shared patterns for consistency:
Acts as the primary coordinator, delegating to specialized sub-Makefiles:
make help # Show all available targets
make lint-test-all # Run full test suite (delegates to all modules)
make health-monitors-lint-test-all # Delegate to health-monitors/Makefile
make docker-all # Delegate to docker/Makefile
make dev-env # Delegate to dev/Makefile
make kubernetes-distro-lint # Delegate to distros/kubernetes/MakefileShared build/test/Docker patterns for all Go modules:
# Included by all Go modules with: include ../common.mk
# Provides consistent targets:
all # Default target: lint-test
lint-test # Full lint and test (matches CI)
vet, lint, test, coverage, build, binary # Individual steps
docker-build, docker-publish # Docker targets (if HAS_DOCKER=1)
setup-buildx, clean, help # Utility targetsCoordinates all health monitoring modules:
make -C health-monitors help # Show health monitor targets
make -C health-monitors lint-test-all # Test all health monitors
make -C health-monitors go-lint-test-all # Test Go health monitors
make -C health-monitors python-lint-test-all # Test Python health monitors
make -C health-monitors build-all # Build all health monitorsEach Go module includes common.mk for consistent patterns:
# All Go modules have identical interface via common.mk
make -C health-monitors/syslog-health-monitor help # Help target
make -C health-monitors/syslog-health-monitor lint-test # Full lint-test
make -C platform-connectors lint-test # Same pattern
make -C health-events-analyzer lint-test # Same pattern
# Individual development steps (common.mk patterns)
make -C health-monitors/syslog-health-monitor vet # go vet ./...
make -C health-monitors/syslog-health-monitor lint # golangci-lint run
make -C health-monitors/syslog-health-monitor test # gotestsum
make -C health-monitors/syslog-health-monitor coverage # coverage reports
make -C health-monitors/syslog-health-monitor build # go build ./...
make -C health-monitors/syslog-health-monitor binary # go build main binaryDelegation-based Docker build system:
make -C docker help # Show all Docker targets and configuration
# Main build targets (delegates to individual modules)
make -C docker build-all # Build all images (delegates to modules)
make -C docker publish-all # Build and push all images
make -C docker setup-buildx # Setup Docker buildx builder
# Group targets
make -C docker build-health-monitors # Build all health monitor images
make -C docker build-main-modules # Build all non-health-monitor images
# Individual module targets (delegates to module Makefiles)
make -C docker build-syslog-health-monitor # Calls module's docker-build
make -C docker build-csp-health-monitor # Calls module's docker-build
make -C docker build-gpu-health-monitor-dcgm3 # Calls module's docker-build-dcgm3
make -C docker build-gpu-health-monitor-dcgm4 # Calls module's docker-build-dcgm4
make -C docker build-platform-connectors # Calls module's docker-build
make -C docker build-health-events-analyzer # Calls module's docker-build
make -C docker build-log-collector # Calls module's docker-build
# Publish targets (delegates to modules)
make -C docker publish-syslog-health-monitor # Calls module's docker-publish
make -C docker publish-all # Calls all modules' docker-publish
# Utility targets
make -C docker clean # Remove all nvsentinel images
make -C docker list # List built nvsentinel imagesKey Features:
- Delegation-based: Each module is single source of truth for its Docker config
- Multi-platform builds:
linux/arm64,linux/amd64viacommon.mk - Build caching: Registry-based cache for faster builds
- Consistent patterns: Go modules use
common.mk, specialized for Python/shell - Dynamic tagging: Uses
${SAFE_REF_NAME}from branch/tag names - Registry integration: Full NVCR.io paths and authentication
Focused on development environment:
make -C dev help # Show development targets
make -C dev env-up # Create cluster + start Tilt
make -C dev env-down # Stop Tilt + delete cluster
make -C dev cluster-create # Create Kind cluster
make -C dev tilt-up # Start Tilt
make -C dev cluster-status # Check cluster statusHelm and Kubernetes operations:
make -C distros/kubernetes help # Show Kubernetes targets
make -C distros/kubernetes lint # Lint Helm charts
make -C distros/kubernetes helm-publish # Publish Helm chart# 1. Full development cycle
make dev-env # Start development environment
make lint-test-all # Test all modules
make docker-all # Build containers (delegates to modules)
make dev-env-clean # Clean up
# 2. Individual module development (common.mk patterns)
make platform-connectors-lint-test # Test specific Go module (main Makefile)
make -C platform-connectors lint-test # Test directly (common.mk pattern)
make -C platform-connectors docker-build # Build container (common.mk)
# 3. Focused development on specific module (common.mk targets)
cd platform-connectors
make lint-test # Full module test
make vet # Quick syntax check
make test # Run tests only
make build # Build module
make binary # Build main binary
# 4. Health monitors (coordination + common.mk patterns)
make health-monitors-lint-test-all # All health monitors
make -C health-monitors/syslog-health-monitor lint-test # Specific health monitorAll Go modules use consistent patterns via common.mk:
- Consistent targets:
lint-test,vet,lint,test,build,binary - Docker integration:
docker-build,docker-publish(ifHAS_DOCKER=1) - Unified configuration: Same environment variables and build flags
- Backwards compatibility: Legacy targets (
image,publish) still work
-
For Go modules:
cd your-module/ go get github.com/new/dependency@v1.2.3 go mod tidy -
For Python modules:
cd health-monitors/gpu-health-monitor/ poetry add new-dependency
- Edit
.protofiles inprotobufs/directory - Regenerate code:
make protos-lint
- Update affected modules and test
- Update Helm values in
distros/kubernetes/nvsentinel/values.yaml - Update templates in
distros/kubernetes/nvsentinel/templates/ - Update module code to read new configuration
- Test with Tilt or manual Helm install
# Enable pprof in Go applications
import _ "net/http/pprof"
# Access profiles
go tool pprof http://localhost:6060/debug/pprof/profile
go tool pprof http://localhost:6060/debug/pprof/heap- Never break backward compatibility
- Add fields with default values
- Use MongoDB schema validation if needed
- Test with existing data
- Start Small: Make incremental changes
- Test Early: Write tests alongside code
- Document Changes: Update relevant documentation
- Review Dependencies: Minimize external dependencies
- Monitor Resources: Be aware of CPU/memory usage
- Resource Limits: Always set resource requests/limits
- Health Checks: Implement readiness and liveness probes
- Graceful Shutdown: Handle SIGTERM properly
- Security Context: Run with minimal privileges
- Observability: Emit metrics and structured logs
- Indexes: Create appropriate indexes for queries
- Connection Pooling: Reuse connections efficiently
- Change Streams: Use resume tokens for reliability
- Error Handling: Handle network partitions gracefully
🎯 Usage Examples:
Local Development Workflow:
# Build for local testing (loads into local Docker daemon)
make -C docker build-syslog-health-monitor # Individual module
make -C docker build-all # All modules
make -C health-monitors/gpu-health-monitor docker-build-dcgm3 # Specific variant
# Test the built images locally
# Test the built images locally
docker run ghcr.io/your-github-username/nvsentinel-syslog-health-monitor:localCI/Production Workflow:
# Environment setup (matches GitHub Actions)
export CONTAINER_REGISTRY="ghcr.io"
export CONTAINER_ORG="your-github-username"
export CI_COMMIT_REF_NAME="main"
# Authentication handled by docker login to ghcr.io
# Build and push directly to registry (standardized patterns)
make -C docker publish-syslog-health-monitor # Individual module
make -C docker publish-all # All modules
make -C health-monitors/gpu-health-monitor docker-publish # Both DCGM variantsDevelopment vs CI Behavior:
# Development: Fast local build (recommended)
make -C health-monitors/syslog-health-monitor docker-build-local
# Development: Full featured build (slower, like CI)
make -C health-monitors/syslog-health-monitor docker-build
# CI/Production: Build and push with --push (standardized)
make -C health-monitors/syslog-health-monitor docker-publish- Internal Documentation: Check module-specific READMEs and
make helptargets - GitHub Issues: Report bugs and feature requests
- Team Chat: Reach out to the development team
- Code Reviews: Learn from feedback on pull requests
- Makefile Help: Use
make helpin any module for target documentation - Common Patterns: All Go modules follow
common.mkpatterns for consistency
Happy coding! 🚀
For questions about this guide or the development process, please reach out to the NVSentinel development team.