Skip to content

Error: problem obtaining number of CUDA devices: 205 #852

@crazynds

Description

@crazynds

The command:

mlcr run-mlperf,inference,_find-performance,_full,_r5.1-dev \
   --model=llama3_1-8b \
   --implementation=reference \
   --framework=vllm \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=50 --rerun

I entered in the container with the error:

[2026-03-05 18:38:35,933 repo_action.py : 352 INFO ] - Repository mlperf-automations already exists at /home/mlcuser/MLC/repos/mlcommons@mlperf-automations. Checking for local changes...
[2026-03-05 18:38:35,939 repo_action.py : 363 INFO ] - No local changes detected. Pulling latest changes...
Already up to date.
[2026-03-05 18:38:36,375 repo_action.py : 365 INFO ] - Repository successfully pulled.
[2026-03-05 18:38:36,375 repo_action.py : 379 INFO ] - Registering the repo in repos.json
[2026-03-05 18:38:36,583 script_utils.py:  88 INFO ] - * mlcr app,mlperf,inference,generic,_reference,_llama3_1-8b,_vllm,_cuda,_test,_r5.1-dev_default,_offline
[2026-03-05 18:38:36,588 script_utils.py:  88 INFO ] -   * mlcr detect,os
[2026-03-05 18:38:36,594 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/detect-os_2aa9b536/mlc-cached-state.json
[2026-03-05 18:38:36,597 script_utils.py:  88 INFO ] -   * mlcr get,sys-utils-mlc
[2026-03-05 18:38:36,598 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/get-sys-utils-mlc_497501b4/mlc-cached-state.json
[2026-03-05 18:38:36,601 script_utils.py:  88 INFO ] -   * mlcr get,python
[2026-03-05 18:38:36,602 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/get-python3_22454046/mlc-cached-state.json
[2026-03-05 18:38:36,612 script_utils.py:  88 INFO ] -   * mlcr get,mlcommons,inference,src
[2026-03-05 18:38:36,614 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/get-mlperf-inference-src_8ebdbf7c/mlc-cached-state.json
[2026-03-05 18:38:36,616 script_utils.py:  88 INFO ] -   * mlcr get,mlperf,inference,utils
[2026-03-05 18:38:36,626 script_utils.py:  88 INFO ] -     * mlcr get,mlperf,inference,src
[2026-03-05 18:38:36,627 module.py      :1018 INFO ] -          ! load /home/mlcuser/MLC/repos/local/cache/get-mlperf-inference-src_8ebdbf7c/mlc-cached-state.json
[2026-03-05 18:38:36,638 module.py      :5081 INFO ] -          ! call "postprocess" from /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-mlperf-inference-utils/customize.py
[2026-03-05 18:38:36,650 script_utils.py:  88 INFO ] -   * mlcr get,cuda-devices
[2026-03-05 18:38:36,669 script_utils.py:  88 INFO ] -     * mlcr get,cuda,_toolkit
[2026-03-05 18:38:36,687 script_utils.py:  88 INFO ] -       * mlcr detect,os
[2026-03-05 18:38:36,688 module.py      :1018 INFO ] -            ! load /home/mlcuser/MLC/repos/local/cache/detect-os_2aa9b536/mlc-cached-state.json
[2026-03-05 18:38:36,701 module.py      :4064 INFO ] -           # Requested paths: /home/mlcuser/venv/mlcflow/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin:/home/mlcuser/.local/bin:/usr/local/cuda/bin:/usr/cuda/bin:/usr/local/cuda-11/bin:/usr/cuda-11/bin:/usr/local/cuda-12/bin:/usr/cuda-12/bin:/usr/local/packages/cuda
[2026-03-05 18:38:36,717 module.py      :3786 INFO ] -           * /usr/local/cuda/bin/nvcc
[2026-03-05 18:38:36,717 module.py      :4936 INFO ] -                  ! cd /home/mlcuser/MLC/repos/local/cache/get-cuda_d20b3288
[2026-03-05 18:38:36,717 module.py      :4937 INFO ] -                  ! call /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-cuda/run.sh from tmp-run.sh
[2026-03-05 18:38:36,740 module.py      :5081 INFO ] -                  ! call "detect_version" from /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-cuda/customize.py
[2026-03-05 18:38:36,750 customize.py   : 107 INFO ] -           Detected version: 12.6
[2026-03-05 18:38:36,750 module.py      :4130 INFO ] -           # Found artifact in /usr/local/cuda/bin/nvcc
[2026-03-05 18:38:36,751 module.py      :4936 INFO ] -              ! cd /home/mlcuser/MLC/repos/local/cache/get-cuda_d20b3288
[2026-03-05 18:38:36,751 module.py      :4937 INFO ] -              ! call /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-cuda/run.sh from tmp-run.sh
[2026-03-05 18:38:36,773 module.py      :5081 INFO ] -              ! call "postprocess" from /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-cuda/customize.py
[2026-03-05 18:38:36,782 customize.py   : 107 INFO ] -           Detected version: 12.6
[2026-03-05 18:38:36,792 module.py      :1844 INFO ] -         - cache UID: d20b328851f4435e
[2026-03-05 18:38:36,792 module.py      :1911 INFO ] - ENV[CUDA_HOME]: /usr/local/cuda
[2026-03-05 18:38:36,792 module.py      :1911 INFO ] - ENV[MLC_CUDA_PATH_LIB_CUDNN_EXISTS]: no
[2026-03-05 18:38:36,792 module.py      :1911 INFO ] - ENV[MLC_CUDA_VERSION]: 12.6
[2026-03-05 18:38:36,792 module.py      :1911 INFO ] - ENV[MLC_CUDA_VERSION_STRING]: cu126
[2026-03-05 18:38:36,792 module.py      :1911 INFO ] - ENV[MLC_NVCC_BIN_WITH_PATH]: /usr/local/cuda/bin/nvcc
[2026-03-05 18:38:36,803 module.py      :4936 INFO ] -          ! cd /home/mlcuser/MLC/repos/local/cache/get-cuda-devices_103b073c
[2026-03-05 18:38:36,803 module.py      :4937 INFO ] -          ! call /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-cuda-devices/run.sh from tmp-run.sh
rm: cannot remove 'a.out': No such file or directory

NVCC path: /usr/local/cuda/bin/nvcc


Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0

Compiling program ...


Running program ...

[2026-03-05 18:38:37,438 module.py      :5011 INFO ] - ========================================================
[2026-03-05 18:38:37,439 module.py      :5013 INFO ] - Print file tmp-run.out:
[2026-03-05 18:38:37,439 module.py      :5014 INFO ] - 
[2026-03-05 18:38:37,439 module.py      :5015 INFO ] - Error: problem obtaining number of CUDA devices: 205

[2026-03-05 18:38:37,439 module.py      :5016 INFO ] - 
Traceback (most recent call last):
  File "/home/mlcuser/venv/mlcflow/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/main.py", line 91, in mlcr
    mlc_expand_short("run")
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/main.py", line 88, in mlc_expand_short
    main()
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/main.py", line 380, in main
    res = method(run_args)
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/script_action.py", line 386, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/script_action.py", line 282, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed in {module_path}. \nError : {error}")
mlc.script_action.ScriptExecutionError: Script run execution failed in /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py. 
Error : Native run script failed inside MLC script (name = get-cuda-devices, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.

And after that I ran:

mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only \
   --model=llama3_1-8b \
   --implementation=reference \
   --framework=vllm \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=valid \
   --device=cuda \
   --quiet

And got the same error:

[2026-03-05 18:42:51,125 script_utils.py:  88 INFO ] - * mlcr run-mlperf,inference,_full,_r5.1-dev,_performance-only
[2026-03-05 18:42:51,127 script_utils.py:  88 INFO ] -   * mlcr detect,os
[2026-03-05 18:42:51,134 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/detect-os_2aa9b536/mlc-cached-state.json
[2026-03-05 18:42:51,136 script_utils.py:  88 INFO ] -   * mlcr detect,cpu
[2026-03-05 18:42:51,137 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/detect-cpu_c14c2896/mlc-cached-state.json
[2026-03-05 18:42:51,140 script_utils.py:  88 INFO ] -   * mlcr get,python3
[2026-03-05 18:42:51,142 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/get-python3_22454046/mlc-cached-state.json
[2026-03-05 18:42:51,152 script_utils.py:  88 INFO ] -   * mlcr get,mlcommons,inference,src
[2026-03-05 18:42:51,153 module.py      :1018 INFO ] -        ! load /home/mlcuser/MLC/repos/local/cache/get-mlperf-inference-src_8ebdbf7c/mlc-cached-state.json
[2026-03-05 18:42:51,157 script_utils.py:  88 INFO ] -   * mlcr get,sut,description
[2026-03-05 18:42:51,162 script_utils.py:  88 INFO ] -     * mlcr detect,os
[2026-03-05 18:42:51,163 module.py      :1018 INFO ] -          ! load /home/mlcuser/MLC/repos/local/cache/detect-os_2aa9b536/mlc-cached-state.json
[2026-03-05 18:42:51,165 script_utils.py:  88 INFO ] -     * mlcr detect,cpu
[2026-03-05 18:42:51,166 module.py      :1018 INFO ] -          ! load /home/mlcuser/MLC/repos/local/cache/detect-cpu_c14c2896/mlc-cached-state.json
[2026-03-05 18:42:51,169 script_utils.py:  88 INFO ] -     * mlcr get,python3
[2026-03-05 18:42:51,171 module.py      :1018 INFO ] -          ! load /home/mlcuser/MLC/repos/local/cache/get-python3_22454046/mlc-cached-state.json
[2026-03-05 18:42:51,186 script_utils.py:  88 INFO ] -     * mlcr get,compiler
[2026-03-05 18:42:51,187 module.py      :1018 INFO ] -          ! load /home/mlcuser/MLC/repos/local/cache/get-llvm_8c45579b/mlc-cached-state.json
[2026-03-05 18:42:51,191 script_utils.py:  88 INFO ] -     * mlcr get,cuda-devices
[2026-03-05 18:42:51,204 script_utils.py:  88 INFO ] -       * mlcr get,cuda,_toolkit
[2026-03-05 18:42:51,205 module.py      :1018 INFO ] -            ! load /home/mlcuser/MLC/repos/local/cache/get-cuda_d20b3288/mlc-cached-state.json
[2026-03-05 18:42:51,206 module.py      :1911 INFO ] - ENV[CUDA_HOME]: /usr/local/cuda
[2026-03-05 18:42:51,206 module.py      :1911 INFO ] - ENV[MLC_CUDA_PATH_LIB_CUDNN_EXISTS]: no
[2026-03-05 18:42:51,206 module.py      :1911 INFO ] - ENV[MLC_CUDA_VERSION]: 12.6
[2026-03-05 18:42:51,206 module.py      :1911 INFO ] - ENV[MLC_CUDA_VERSION_STRING]: cu126
[2026-03-05 18:42:51,206 module.py      :1911 INFO ] - ENV[MLC_NVCC_BIN_WITH_PATH]: /usr/local/cuda/bin/nvcc
[2026-03-05 18:42:51,217 module.py      :4936 INFO ] -            ! cd /home/mlcuser/MLC/repos/local/cache/get-cuda-devices_103b073c
[2026-03-05 18:42:51,217 module.py      :4937 INFO ] -            ! call /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/script/get-cuda-devices/run.sh from tmp-run.sh

NVCC path: /usr/local/cuda/bin/nvcc


Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0

Compiling program ...


Running program ...

[2026-03-05 18:42:51,857 module.py      :5011 INFO ] - ========================================================
[2026-03-05 18:42:51,857 module.py      :5013 INFO ] - Print file tmp-run.out:
[2026-03-05 18:42:51,857 module.py      :5014 INFO ] - 
[2026-03-05 18:42:51,857 module.py      :5015 INFO ] - Error: problem obtaining number of CUDA devices: 205

[2026-03-05 18:42:51,857 module.py      :5016 INFO ] - 
Traceback (most recent call last):
  File "/home/mlcuser/venv/mlcflow/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/main.py", line 91, in mlcr
    mlc_expand_short("run")
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/main.py", line 88, in mlc_expand_short
    main()
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/main.py", line 380, in main
    res = method(run_args)
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/script_action.py", line 386, in run
    return self.call_script_module_function("run", run_args)
  File "/home/mlcuser/venv/mlcflow/lib/python3.10/site-packages/mlc/script_action.py", line 282, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed in {module_path}. \nError : {error}")
mlc.script_action.ScriptExecutionError: Script run execution failed in /home/mlcuser/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py. 
Error : Native run script failed inside MLC script (name = get-cuda-devices, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions