This repository implements a multi-modal 3D semantic segmentation and reconstruction system that fuses LiDAR point clouds and RGB images for autonomous scene understanding.
Building on the Stanford 3D segmentation framework [1], this project integrates DINOv3 [2] as the visual backbone in place of DINOv2 [3], evaluated on the nuScenes dataset [4].
By leveraging DINOv3’s enhanced visual embeddings and zero-shot capabilities, the system aims to achieve more accurate, generalizable, and efficient 3D semantic mapping.
- Fusion of LiDAR (PTv3) [5] and image (DINOv3 or DINOv2) features
- Evaluated on nuScenes-Lidarseg for real-world driving scenes
- Includes PointTransformerV3 integration and flexible feature fusion design
- 3D Backbone: PointTransformerV3 (LiDAR feature encoder)
- 2D Backbone: DINOv3 ViT-S/16 or DINOv2 (visual feature extractor)
- Fusion Module: projection-based cross-modal MLP fusion
- Training: Cross-Entropy + Lovasz-Softmax loss for per-class IoU optimization
mkdir <Path/To/dataset/nuscenes>
cd <Path/To/dataset/nuscenes>wget https://www.nuscenes.org/data/v1.0-mini.tgz
tar -xvzf v1.0-mini.tgz
wget https://www.nuscenes.org/data/nuScenes-lidarseg-mini-v1.0.tar.bz2
tar -xvjf nuScenes-lidarseg-mini-v1.0.tar.bz2wget -c "https://motional-nuscenes.s3.amazonaws.com/public/v1.0/v1.0-trainval01_blobs.tgz"
tar -xvzf v1.0-trainval01_blobs.tgz
wget -c "https://motional-nuscenes.s3.amazonaws.com/public/v1.0/v1.0-trainval_meta.tgz"
tar -xvzf v1.0-trainval_meta.tgz
wget -c "https://d36yt3mvayqw5m.cloudfront.net/public/nuscenes-lidarseg-v1.0/nuScenes-lidarseg-all-v1.0.tar.bz2"
tar -xvjf nuScenes-lidarseg-all-v1.0.tar.bz2Use one of the train-val set as test set (label provided).
wget -c "ttps://motional-nuscenes.s3.amazonaws.com/public/v1.0/v1.0-trainval02_blobs.tgz"git clone --recursive git@github.com:CharlesChang012/3D-Semantic-Segmentation-Fusion.git2.2. Launch an Interactive GPU Session if using UMich HPC [6] (Skip to Step 2.4 if using local machine)
Start a GPU session (see UMich documentation):
salloc --account=<account> \
--partition=spgpu,gpu_mig40 \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=4 \
--gpus=1 \
--mem=16G \
--time=01:00:00Helper commands:
- Check available accounts:
my_accounts
- Monitor active jobs:
watch squeue --me
module load cuda/12.1.1
module load gcc/11.2.0
nvcc --versioncd 3D-Semantic-Segmentation-Fusion
conda env create -f environment.yml
conda activate 3DSSFcd PointTransformerV3/Pointcept/libs/pointops/
python setup.py installpython main_train.pypython main_test.pypython main_sample.py[1] Anze Liu, 3D Semantic Segmentation with 3D LiDAR Point Clouds and 2D Camera Images for Autonomous Driving, Standford
[2] Oquab et al., DINOv3: Self-Supervised Vision Transformers with Masked Image Modeling, Meta AI, 2025
[3] Oquab et al., DINOv2: Learning Robust Visual Features without Supervision, Meta AI, 2023
[4] Caesar et al., nuScenes: A Multimodal Dataset for Autonomous Driving, CVPR 2020
[5] Wu et al., Point Transformer V3: Simplicity, Efficiency, and Performance in Point Cloud Transformers, arXiv 2024
[6] UMich ARC Documentation
