Jian Zhang1*, Shijie Zhou2*, Bangya Liu3*, Achuta Kadambi2, Zhiwen Fan1
1 Texas A&M University, 2 University of California, Los Angeles, 3 University of Wisconsin-Madison
* Equal contribution.
- 2026-04-10: Added Qwen3.5 training, evaluation, and inference support.
- 2026-02-21: Our paper has been accepted to CVPR 2026. See you in Denver!
SpatialStack progressively aligns vision, geometry, and language representations across model layers, moving beyond single-stage late fusion and improving both local geometric precision and global spatial semantics.
- Systematic Analysis of Fusion Layers. Layer-wise analysis of fusion across vision encoder, geometry encoder, and LLM decoder, revealing a hierarchical geometry-language correspondence.
- SpatialStack Framework. A hierarchical fusion design that progressively aligns multi-level geometric and language features for joint local-global spatial reasoning.
- VLM-SpatialStack Realization. A concrete geometry-aware multimodal LLM with state-of-the-art performance on diverse 3D spatial reasoning benchmarks.
-
Support training, inference, and evaluation based on Qwen3.5.
This public branch documents Qwen3.5 only. Validated with Python 3.12, PyTorch 2.10.0+cu129, flash_attn 2.8.3. Run all commands from the repository root.
conda create -n spatialstack-qwen35 python=3.12 -y
conda activate spatialstack-qwen35Install Miniconda first (if needed)
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p "$HOME/miniconda3"
source "$HOME/miniconda3/bin/activate"pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 \
--index-url https://download.pytorch.org/whl/cu129Ensure CUDA 12.9 is available in your environment before installing
flash_attn.
pip install psutil ninja wheel setuptools packaging
pip install flash_attn==2.8.3 --no-build-isolationpip install --upgrade transformers==5.3.0 accelerate==1.13.0 qwen_vl_utils==0.0.14 decord
pip install -U git+https://github.com/Dao-AILab/causal-conv1d --no-build-isolation
pip install -U git+https://github.com/fla-org/flash-linear-attentionpip install -e . --no-depspython - <<'PY'
import torch, transformers, qwen_vl_utils, causal_conv1d, fla, decord
print("torch", torch.__version__, "cuda", torch.version.cuda)
print("transformers", transformers.__version__)
print("qwen_vl_utils", qwen_vl_utils.__version__)
print("causal_conv1d", causal_conv1d.__version__)
print("fla", fla.__version__)
print("decord", decord.__version__)
PYAdditional packages for lmms_eval (optional)
pip install datasets pyarrow evaluate pytablewriter pandas \
loguru jsonlines sqlitedict sacrebleu terminaltables zss tenacity==8.3.0 \
wandb openai tiktoken scipy openpyxl numexpr sympy nltk sentencepiece ftfy \
timm opencv-python-headless av tqdm-multiprocess transformers-stream-generator \
hf_transfer| Model | Base | Geometry Encoder | Size | Path / HF ID |
|---|---|---|---|---|
| SpatialStack-Qwen3.5-4B | Qwen3.5-4B | VGGT-1B, layers [11,17,23] -> [0,1,2] | 14 GB | Journey9ni/SpatialStack-Qwen3.5-4B |
python scripts/inference/infer.py \
--model-path Journey9ni/SpatialStack-Qwen3.5-4B \
--image assets/sofas.jpg \
--prompt "Describe this scene in a few complete sentences." \
--disable-thinking \
--max-new-tokens 128Options:
| Flag | Description |
|---|---|
--model-path |
HF model id or local checkpoint path |
--image / --image-dir / --video |
Input visual (mutually exclusive, required) |
--disable-thinking |
Skip reasoning trace, output final answer directly |
--max-new-tokens |
Default 512. Use ~1024 if thinking mode is enabled |
--no-flash-attn2 |
Fall back to non-FlashAttention path |
--add-frame-index |
Insert Frame-i: tokens before each image |
Run with the stock Qwen3.5 base model (no SpatialStack weights)
python scripts/inference/infer.py \
--model-path Qwen/Qwen3.5-4B \
--image assets/sofas.jpg \
--prompt "Describe this scene in a few complete sentences." \
--disable-thinking \
--max-new-tokens 128MODEL_PATH=Journey9ni/SpatialStack-Qwen3.5-4B \
MODEL_IMPL=qwen3_5 \
MODEL_ARGS_BASE="pretrained=Journey9ni/SpatialStack-Qwen3.5-4B,disable_thinking=true,max_num_frames=32,max_length=12800" \
OUTPUT_ROOT=logs/eval/spatialstack_qwen35_4b \
BENCHMARKS="vsibench" \
bash scripts/evaluation/eval.shAvailable benchmarks: vsibench, cvbench, blink_spatial, sparbench, videomme, mmsibench (comma-separated).
All eval parameters
| Variable | Description |
|---|---|
MODEL_PATH |
HF model id or local checkpoint path |
MODEL_IMPL |
Model implementation (qwen3_5, spatialstack) |
OUTPUT_ROOT |
Root directory for evaluation outputs |
BENCHMARKS |
Comma-separated benchmark list |
CUDA_VISIBLE_DEVICES |
Select visible GPU ids |
NUM_MACHINES / PROCESSES_PER_MACHINE / MACHINE_RANK |
Distributed launch settings |
MASTER_ADDR / MASTER_PORT |
Multi-node rendezvous settings |
Outputs: *_results.json (aggregated metrics), *_samples_<task>.jsonl (per-sample logs).
See TRAINING.md for the full training workflow, including data preparation and launch settings.
Quick start:
MODEL_PATH=Qwen/Qwen3.5-4B \
USE_GEOMETRY_ENCODER=False \
DATA_FLATTEN=False \
OUTPUT_DIR=./output/qwen35_stock_train \
bash scripts/train/train.shFor multi-node Slurm runs (8 nodes x 8 H200 GPUs):
sbatch scripts/train/slurm/run_qwen35_64gpu_vision.sbatchThanks to the following open-source projects: VLM-3R, Spatial-MLLM, VG-LLM, SPAR, Qwen3-VL, Qwen3.5, Cambrian-S, LLaVA-Hound-DPO, VGGT, Thinking in Space
If you find this work useful for your research, please consider citing our paper:
@article{zhang2026spatialstack,
title={SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning},
author={Zhang, Jiang and Zhou, Shijie and Liu, Bangya and Kadambi, Achuta and Fan, Zhiwen},
journal={arXiv preprint arXiv:2603.27437},
year={2026}
}
