Intel GPU Software Stack — Hardware → PyTorch

The path a compute call takes from a framework down to the silicon on an Intel discrete GPU, with the NVIDIA/CUDA counterpart for each component. Reference device: Battlemage G21 (BMG, Xe2) · oneAPI DPC++ 2025.3 · torch +xpu.

Framework / app Runtime / libraries SPIR-V → ISA (JIT) User-mode driver (UMD) Kernel-mode driver (KMD) Silicon / firmware NVIDIA counterpart

APP

Application / framework

PyTorch + domain code; issues tensor ops on a GPU device

ComponentWhat it isNVIDIA counterpart

torch (+xpu)

PyTorch built against the Intel XPU backend; tensors/ops dispatch to SYCL.

NVIDIAtorch (+cu12) — CUDA build

torch.xpu

Device API for stream/event/memory management on the GPU.

NVIDIAtorch.cuda

c10 / aten XPU

PyTorch's tensor backend impl: USM allocations, kernels, streams.

NVIDIAc10 / aten CUDA

custom kernels

Hand-written device kernels, e.g. the ARK WOQ GEMM, written in SYCL.

NVIDIACUDA C++ / CUTLASS kernels

▼ SYCL host API · (NVIDIA: CUDA Runtime API)

oneAPI DPC++ / SYCL runtime + math libraries

selects the device, manages USM/queues, exposes device-side templates, hands device images to the driver

LibraryWhat it isNVIDIA counterpart

libsycl.soDPC++ SYCL runtime

The SYCL runtime: device discovery, queues, USM, kernel launch, and host-side template glue. Owns most runtime error messages.

NVIDIAlibcudart.so (CUDA Runtime) + libnvrtc

ext::oneapi::matrixjoint_matrix<…>

Header-only SYCL templates for warp/sub-group matrix-multiply on the XMX engines; lowered to DPAS by IGC.

NVIDIAnvcuda::wmma / mma.sync (Tensor Core intrinsics)

oneDNNlibdnnl

Deep-learning primitive library: matmul, convolution, attention, with its own tuned GPU codegen.

NVIDIAcuDNN

oneMKLlibmkl_sycl

Math kernel library: dense/sparse BLAS, LAPACK, FFT, RNG for the GPU.

NVIDIAcuBLAS / cuSOLVER / cuFFT / cuRAND

libimf / SVMLdevice math

Intel math functions (sin, exp, …) linked into device code for accurate/vectorized transcendentals.

NVIDIAlibdevice (NVVM math bitcode)

▼ device image (SPIR-V or AOT ISA) via Level-Zero · (NVIDIA: PTX/cubin via CUDA driver)

⚡ SPIR-V → GPU ISA — JIT compilation (IGC)

Generic device code ships as portable SPIR-V (spir64) and is JIT-compiled to real GPU ISA at runtime by IGC inside the compute-runtime. AOT builds (spir64_gen -device bmg) skip this by baking ISA at build time. The exact CUDA analogue: PTX is the portable IR, SASS is the final ISA, and ptxas (in the driver) is the JIT.

ComponentWhat it isNVIDIA counterpart

IGClibigc2 (intel-graphics-compiler)

The Intel Graphics Compiler: JIT-compiles SPIR-V into Xe machine ISA at kernel-load time. Decides whether high-level ops (e.g. joint_matrix) lower to DPAS.

NVIDIAptxas + NVVM (JIT PTX → SASS, inside libcuda)

intel-oclocoffline compiler

Ahead-of-time (AOT) compiler: pre-bakes SPIR-V into device ISA at build time so no runtime JIT is needed.

NVIDIAnvcc / ptxas (AOT to cubin / fatbin)

spir64 imagedevice binary

The portable intermediate device binary embedded in the host executable; vendor-neutral until JIT'd.

NVIDIAPTX (portable) / cubin (AOT, arch-specific)

▼ Level-Zero / OpenCL dispatch · (NVIDIA: CUDA Driver API)

UMD

User-Mode Driver — Intel compute-runtime (NEO)

implements Level-Zero & OpenCL; embeds IGC; builds and submits GPU command buffers

LibraryWhat it isNVIDIA counterpart

libze_loader.solevel-zero (libze1)

The vendor-neutral Level-Zero loader: implements the L0 API spec and dispatches calls to whichever backend driver is installed. Apps link this, not the vendor driver.

NVIDIA(no split — CUDA is single-vendor; closest is the libcuda dispatch itself)

libze_intel_gpu.solibze-intel-gpu1 (compute-runtime)

Intel's L0 backend driver — the actual GPU implementation the loader routes to. Embeds IGC, builds command buffers, manages submission. The version that gates JIT-lowering.

NVIDIAlibcuda.so (CUDA driver / user-mode)

intel-opencl-icdOpenCL UMD

The OpenCL implementation for the GPU; oneDNN's GPU backend needs it for device discovery even when compute runs on L0.

NVIDIAlibnvidia-opencl.so (NVIDIA's OpenCL ICD)

libigdgmm12Intel GMM

Graphics Memory Management library: tiling/layout and allocation bookkeeping shared by the UMDs.

NVIDIA(internal to libcuda — no separate library)

▼ ioctl (DRM) → kernel · (NVIDIA: ioctl → nvidia.ko)

KMD

Kernel-Mode Driver (Linux DRM)

GPU scheduling, VRAM/GTT memory, context & command-buffer submission

ComponentWhat it isNVIDIA counterpart

xe / i915DRM driver

In-tree Linux kernel driver for Intel GPUs (Battlemage uses the newer xe); handles scheduling, memory, contexts via the DRM subsystem.

NVIDIAnvidia.ko (out-of-tree kernel module)

GuC / HuCfirmware

On-GPU firmware microcontrollers for command submission (GuC) and media (HuC).

NVIDIAGSP firmware (RM offload)

/dev/dri/renderD*device node

The userspace ABI: the render node file that UMDs open and issue ioctls against.

NVIDIA/dev/nvidia* + /dev/nvidiactl

▼ MMIO / doorbell

Silicon — Battlemage G21 (Xe2)

the execution units and matrix engines that device kernels ultimately target

UnitWhat it isNVIDIA counterpart

Xe2 Xe-cores256 EU (0xe223)

The programmable compute cores; each Xe-core groups several vector EUs plus matrix engines.

NVIDIASMs (Streaming Multiprocessors)

XMX / DPASsystolic matrix array

The matrix-multiply accelerators (int8→int32, bf16/f16) that joint_matrix / dpas target.

NVIDIATensor Cores

SIMD16 EUssub_group = 16

Vector execution lanes; a SYCL sub-group maps to a hardware SIMD width.

NVIDIAwarp (32 threads / SIMT)

VRAMdevice + shared USM

On-board GPU memory plus shared/unified allocations addressable from host and device.

NVIDIAVRAM + Unified Memory

How to read this: a call descends from the framework through the SYCL runtime, crosses the SPIR-V→ISA boundary (JIT-compiled by IGC unless built AOT), is dispatched by the compute-runtime UMD through the DRM kernel driver, and finally executes on the Xe-cores / XMX array. The green column gives the closest CUDA equivalent at each layer — the biggest structural difference is that Intel's stack is vendor-neutral and split (a loader + a backend driver, portable SPIR-V JIT'd on the target) whereas NVIDIA's is single-vendor and monolithic (one libcuda.so with PTX→SASS JIT built in).