Intel GPU Software Stack — Hardware → PyTorch

The path a compute call takes from a framework down to the silicon on an Intel discrete GPU, with the NVIDIA/CUDA counterpart for each component. Reference device: Battlemage G21 (BMG, Xe2) · oneAPI DPC++ 2025.3 · torch +xpu.
Framework / app Runtime / libraries SPIR-V → ISA (JIT) User-mode driver (UMD) Kernel-mode driver (KMD) Silicon / firmware NVIDIA counterpart
APP
Application / framework
PyTorch + domain code; issues tensor ops on a GPU device
ComponentWhat it isNVIDIA counterpart
torch (+xpu)
PyTorch built against the Intel XPU backend; tensors/ops dispatch to SYCL.
NVIDIAtorch (+cu12) — CUDA build
torch.xpu
Device API for stream/event/memory management on the GPU.
NVIDIAtorch.cuda
c10 / aten XPU
PyTorch's tensor backend impl: USM allocations, kernels, streams.
NVIDIAc10 / aten CUDA
custom kernels
Hand-written device kernels, e.g. the ARK WOQ GEMM, written in SYCL.
NVIDIACUDA C++ / CUTLASS kernels
SYCL host API · (NVIDIA: CUDA Runtime API)
RT
oneAPI DPC++ / SYCL runtime + math libraries
selects the device, manages USM/queues, exposes device-side templates, hands device images to the driver
LibraryWhat it isNVIDIA counterpart
libsycl.soDPC++ SYCL runtime
The SYCL runtime: device discovery, queues, USM, kernel launch, and host-side template glue. Owns most runtime error messages.
NVIDIAlibcudart.so (CUDA Runtime) + libnvrtc
ext::oneapi::matrixjoint_matrix<…>
Header-only SYCL templates for warp/sub-group matrix-multiply on the XMX engines; lowered to DPAS by IGC.
NVIDIAnvcuda::wmma / mma.sync (Tensor Core intrinsics)
oneDNNlibdnnl
Deep-learning primitive library: matmul, convolution, attention, with its own tuned GPU codegen.
NVIDIAcuDNN
oneMKLlibmkl_sycl
Math kernel library: dense/sparse BLAS, LAPACK, FFT, RNG for the GPU.
NVIDIAcuBLAS / cuSOLVER / cuFFT / cuRAND
libimf / SVMLdevice math
Intel math functions (sin, exp, …) linked into device code for accurate/vectorized transcendentals.
NVIDIAlibdevice (NVVM math bitcode)
device image (SPIR-V or AOT ISA) via Level-Zero · (NVIDIA: PTX/cubin via CUDA driver)
⚡ SPIR-V → GPU ISA — JIT compilation (IGC)
Generic device code ships as portable SPIR-V (spir64) and is JIT-compiled to real GPU ISA at runtime by IGC inside the compute-runtime. AOT builds (spir64_gen -device bmg) skip this by baking ISA at build time. The exact CUDA analogue: PTX is the portable IR, SASS is the final ISA, and ptxas (in the driver) is the JIT.
ComponentWhat it isNVIDIA counterpart
IGClibigc2 (intel-graphics-compiler)
The Intel Graphics Compiler: JIT-compiles SPIR-V into Xe machine ISA at kernel-load time. Decides whether high-level ops (e.g. joint_matrix) lower to DPAS.
NVIDIAptxas + NVVM (JIT PTX → SASS, inside libcuda)
intel-oclocoffline compiler
Ahead-of-time (AOT) compiler: pre-bakes SPIR-V into device ISA at build time so no runtime JIT is needed.
NVIDIAnvcc / ptxas (AOT to cubin / fatbin)
spir64 imagedevice binary
The portable intermediate device binary embedded in the host executable; vendor-neutral until JIT'd.
NVIDIAPTX (portable) / cubin (AOT, arch-specific)
Level-Zero / OpenCL dispatch · (NVIDIA: CUDA Driver API)
UMD
User-Mode Driver — Intel compute-runtime (NEO)
implements Level-Zero & OpenCL; embeds IGC; builds and submits GPU command buffers
LibraryWhat it isNVIDIA counterpart
libze_loader.solevel-zero (libze1)
The vendor-neutral Level-Zero loader: implements the L0 API spec and dispatches calls to whichever backend driver is installed. Apps link this, not the vendor driver.
NVIDIA(no split — CUDA is single-vendor; closest is the libcuda dispatch itself)
libze_intel_gpu.solibze-intel-gpu1 (compute-runtime)
Intel's L0 backend driver — the actual GPU implementation the loader routes to. Embeds IGC, builds command buffers, manages submission. The version that gates JIT-lowering.
NVIDIAlibcuda.so (CUDA driver / user-mode)
intel-opencl-icdOpenCL UMD
The OpenCL implementation for the GPU; oneDNN's GPU backend needs it for device discovery even when compute runs on L0.
NVIDIAlibnvidia-opencl.so (NVIDIA's OpenCL ICD)
libigdgmm12Intel GMM
Graphics Memory Management library: tiling/layout and allocation bookkeeping shared by the UMDs.
NVIDIA(internal to libcuda — no separate library)
ioctl (DRM) → kernel · (NVIDIA: ioctl → nvidia.ko)
KMD
Kernel-Mode Driver (Linux DRM)
GPU scheduling, VRAM/GTT memory, context & command-buffer submission
ComponentWhat it isNVIDIA counterpart
xe / i915DRM driver
In-tree Linux kernel driver for Intel GPUs (Battlemage uses the newer xe); handles scheduling, memory, contexts via the DRM subsystem.
NVIDIAnvidia.ko (out-of-tree kernel module)
GuC / HuCfirmware
On-GPU firmware microcontrollers for command submission (GuC) and media (HuC).
NVIDIAGSP firmware (RM offload)
/dev/dri/renderD*device node
The userspace ABI: the render node file that UMDs open and issue ioctls against.
NVIDIA/dev/nvidia* + /dev/nvidiactl
MMIO / doorbell
HW
Silicon — Battlemage G21 (Xe2)
the execution units and matrix engines that device kernels ultimately target
UnitWhat it isNVIDIA counterpart
Xe2 Xe-cores256 EU (0xe223)
The programmable compute cores; each Xe-core groups several vector EUs plus matrix engines.
NVIDIASMs (Streaming Multiprocessors)
XMX / DPASsystolic matrix array
The matrix-multiply accelerators (int8→int32, bf16/f16) that joint_matrix / dpas target.
NVIDIATensor Cores
SIMD16 EUssub_group = 16
Vector execution lanes; a SYCL sub-group maps to a hardware SIMD width.
NVIDIAwarp (32 threads / SIMT)
VRAMdevice + shared USM
On-board GPU memory plus shared/unified allocations addressable from host and device.
NVIDIAVRAM + Unified Memory