This repository documents my progress in mastering CUDA programming and High-Performance Computing (HPC). My goal is to understand the hardware architecture deeply and write highly optimized kernels.
- GPU: NVIDIA GeForce RTX 3070 Laptop GPU
- IDE: Visual Studio 2022
- Toolkit: CUDA 13.1
- Profiler: NVIDIA Nsight Compute / Nsight Systems
| # | Project | Key Concepts | Status |
|---|---|---|---|
| 01 | Vector Addition | Grid-Stride Loop, Unified Memory, Profiling | Done |
| 02 | Matrix Multiplication | Shared Memory, Tiling, Vectorized Access (float4) | Done |
| 03 | Parallel Reduction | Warp Divergence, Loop Unrolling, Volatile, Bank Conflicts | Done |
| 04 | N-Body Simulation | Compute vs Memory Bound, Tiling, Thread Coarsening, Occupancy | Done |
| 05 | Spatial Partitioning | Uniform Grid, Atomic Operations | Integrated into Project 06 |
| 06 | Heterogeneous HPC System | Wireless UDP, CUDA-GL Interop, Swarm Logistics | In Progress |
This project builds a comprehensive control pipeline that bridges Low-level Hardware, System Programming, and High-Performance Computing. It now supports both wired (Bare-metal) and wireless (UDP) telemetry.
The system simulates an Edge Computing environment where an external input node (ESP32 or Arduino) controls a massive particle simulation (
graph LR
subgraph Input_Nodes
A[Wireless: ESP32] -- "UDP (Wi-Fi)" --> B
E[Legacy: Arduino] -- "UART (Serial)" --> B
end
subgraph Host_PC
B[IO Thread: Udp/Serial Reader] -- std::atomic --> C[HPC Core: CUDA Kernel]
C -- Zero-Copy Interop --> D[Render: OpenGL]
end
(Text Representation)
[ESP32/Arduino] --(UDP/UART)--> [IO Thread: Receiver] --(Atomic Memory)--> [HPC Core: CUDA Kernel] --(Interop)--> [Render: OpenGL]
- Wireless Modernization: Implemented UDP Telemetry via ESP32-S3 (SoftAP) and C++ WinSock2, breaking physical USB constraints.
- HPC Core (CUDA & OpenGL): Zero-copy rendering with Spatial Partitioning (Uniform Grid) for real-time performance.
- Embedded Interface (Bare-metal): Direct register manipulation (
ADMUX,UBRR0) replacing standard Arduino libraries for ultra-low latency.
Transforming simple boids into a massive Multi-Agent Pathfinding (MAPF) simulation mimicking thousands of AGVs in a warehouse.
- Environmental Physics: Implemented a potential field using constant memory to handle static obstacles parsed from 2D floor plans. (Done)
-
HPC Routing: Achieved
$O(1)$ path lookup for massive agent counts by transitioning from A* to Vector Flow Fields stored in GPU constant memory. (Done) - Local Avoidance: Implementing GPU-accelerated collision avoidance to resolve traffic deadlocks in narrow corridors. (Next)
Consolidating standalone projects into a single, cohesive engine framework.
- Framework: Integrating Dear ImGui over the GLFW/OpenGL pipeline.
- System Design: Abstracting simulations into a
Scenemanagement system.
Expanding the system's hardware abstraction by porting the CUDA-based simulation to the AMD ROCm (HIP) ecosystem.
- Objective: Cross-validate the simulation's throughput across different GPU architectures.