Skip to content

[GSD-12341] [Arrow Lake] System Hangs & >4GB Allocation Failures on OEM Device with Fixed 256MB BAR #890

@psilofski

Description

@psilofski

System Info:

CPU: Intel Core Ultra 7 (Arrow Lake 255H / ThinkBook 16 G8)
GPU: Intel Arc iGPU (Xe-LPG / 140T)
OS: Ubuntu 24.04 (Kernel 6.17). (Issue also verified reproducible on Windows 11 with latest drivers)
Driver: Intel Compute Runtime (Level Zero / NEO)

Issue Description:
I am encountering two distinct, reproducible failure modes on this Arrow Lake platform under compute workloads (PyTorch/OpenVINO):

  1. Allocation Failure (OOM): The runtime fails to allocate single contiguous memory blocks larger than ~4GB, despite the system having 60GB+ of free RAM. (Allocating the same total amount in smaller chunks succeeds).
  2. System Instability (Kernel Panic): During heavy compute tasks involving high-bandwidth access (e.g., VAE Decode, Large Context LLM), the system suffers hard freezes/kernel panics, likely due to GTT thrashing.

Cross-Validation:
This behavior (Hard Freezes on heavy load, Allocation limits) is observed on both Windows 11 and Linux, strongly suggesting a platform-level firmware constraint rather than an OS-specific driver bug.

Root Cause Investigation:
lspci indicates that the device supports Physical Resizable BAR. However, the OEM firmware (Lenovo) locks the CPU-visible aperture to a legacy 256 MB, with no exposed option to enable or resize it.

Context:
My understanding is that Intel Arc iGPUs (Xe-LPG) share the same Arc driver stack, virtual memory model, and BAR-style aperture management as discrete Arc GPUs. Discrete Arc GPUs are documented as requiring ReBAR for optimal performance and stability.

Questions:

  1. Architecture: Does the Arrow Lake Arc iGPU share the architectural requirement for Large/Resizable BAR to ensure stability under heavy compute workloads?
  2. Compliance: Is the Compute Runtime expected to handle >4GB contiguous allocations and heavy thrashing gracefully within a 256 MB aperture, or is this considered an unsupported or out-of-spec firmware configuration for this platform??
  3. Triage: Should these crashes be filed as a memory-management bug in the driver, or is this a platform limitation that must be resolved by the OEM firmware?

Goal:
I am trying to determine whether to open a bug report against the driver's memory manager or if I have grounds to escalate this as a firmware defect to the OEM.

Any clarification on the architectural expectations for BAR sizing on Arrow Lake would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Needs FeedbackWaiting for additional information from reporterType: BugGeneral bug report, unexpected behavior or crash

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions