ARK XPU joint_matrix — The Debug Journey

How an ARK m>1 WOQ GEMM "no matrix hardware on the target device" crash on a Battlemage B70 was chased through three wrong turns to a single-variable root cause — and then proven on the real production kernel, torch-free.

Node root@10.239.98.43 (b70-pc6)
Device Intel Graphics 0xe223 bmg_g31
L0 driver 1.15.37833 (fixed throughout)
Symptom m>1 WOQ GEMM throws at inference
Root cause libsycl matrix arch table
Span 2026-06-03 → 06-05

✓ TL;DR

The crash joint_matrix is not supported is decided by one variable only: the libsycl (DPC++ runtime) version. Holding the GPU, kernel driver, and IGC all byte-identical, swapping libsycl from oneAPI 2025.3.2 (.so.8) to a 2026 build (.so.9) flips the result FAILPASS.

The 2025.3 runtime's compiled matrix supported_archs[] table pre-dates the bmg_g31 entry, so it refuses to advertise ext_intel_matrix on this device even though the silicon, driver, and IGC are all capable.

Verified on ARK's actual production kernel (IGemmDQCore via a torch-free harness), not just a probe. The fix: ship/link a DPC++ runtime whose table includes bmg_g31.

The journey
  1. The symptom & what worked
  2. Theory 1 — "missing SPIR-V extension / TLA-off build bug" wrong
  3. Theory 2 — "IGC is too old" partial
  4. Theory 3 — "the GPU lacks matrix hardware" wrong
  5. Root cause — "it's libsycl" correct
  6. Route B — proving it on the real kernel verified
  7. Why vllm-xpu-kernels never hits this
  8. The fix & escape hatches

The symptom w4g128 quantize + repack succeed; the crash is at inference, m>1 only.

The ARK backend UT quantizes facebook/opt-125m to w4g128, writes shards, and repacks to XPU — all fine. It dies inside model.generate():

auto_round_kernel/qlinear.py:242  -> ark.woqgemm(...)
auto_round_kernel/__init__.py:246 -> lib.woqgemm(...)
RuntimeError: no matrix hardware on the target device, joint_matrix is not supported

Two early facts shaped everything after:

The hardware is fine. PyTorch's own XMX matmul (bf16/f16, which routes through oneDNN) works on this exact device. So the silicon and L0 driver expose DPAS.

Only m>1 fails. The m==1 decode/GEMV path works — BesTLA's GEMV does not use joint_matrix. The crash is specifically ARK's hand-written IGemmDQCore int8 tile (M8×N16×K32, sub_group=16), reached via woq_s8 → sycl_igemm_s8s8 for m>1 / prefill.

1

"The .so was built without the matrix-MMA SPIR-V extension"

2026-06-03 · initial analysis
✗ DISPROVEN — byte-identical binaries

The reasoning

Inspecting the shipped auto_round_lib 0.13.1 wheel, the device image contained SPV_INTEL_split_barrier but was missing SPV_INTEL_subgroup_matrix_multiply_accumulate and SPV_INTEL_2d_block_io — symptoms of an ARK_SYCL_TLA=OFF build. In CMakeLists.txt the matrix-MMA extension is declared only inside SYCL_TLA_LINK_FLAGS (gated on ARK_XPU AND ARK_SYCL_TLA), yet IGemmDQCore uses joint_matrix unconditionally. Conclusion at the time: a TLA-off build emits joint_matrix ops without declaring the capability → runtime can't lower them → throw.

Why it's wrong

The C++ joint_matrix API already emits +SPV_INTEL_joint_matrix by default. A jm_probe built with the explicit +SPV_INTEL_subgroup_matrix_multiply_accumulate flag is byte-identical to one built without it — the extra flag adds a different, unused extension. Both fail identically. No build flag, and no ARK_SYCL_TLA=ON rebuild, changes the outcome.

Lesson

The string "no matrix hardware…" lives in libsycl.so.8 (the DPC++ runtime) — not in ARK and not in the driver. That was the first clue the real determinant was the runtime, not the build. It took two more turns to act on it.
2

"The Intel Graphics Compiler (IGC) is too old"

2026-06-03 · first correction
◐ A REAL CORRELATION — but a co-varying proxy

The reasoning

The same wheel binary PASSES on one node and FAILS on another. The visible difference was IGC — the JIT compiler that lowers joint_matrix SPIR-V into Battlemage DPAS:

PASS nodeFAIL node
Wheel md588c363a4…88c363a4… (identical)
libsycl2025.3.2.202601122025.3.2.20260112 (identical)
IGC2.34.42.32.7
DeviceArc Pro B60Graphics 0xe223 (B70)
ResultPASSEDFAILED

The conclusion — "upgrade libigc2 to ≥ 2.34" — was acted on as THE FIX.

Why it's only partial

This compared two different physical machines. Hardware (B60 vs B70) and the rest of the software stack co-varied with IGC. The correlation was real but not isolated — the B60 node also had a newer everything. IGC turned out to be a passenger, not the driver. (Theory 4 later holds IGC fixed at 2.32.7 and still flips the result.)

Lesson

A version table across two machines proves correlation, not causation. The only way to assign cause is to change one variable on one machine. That discipline is what finally cracked it.
3

"This GPU silicon doesn't advertise the matrix aspect"

2026-06-05 · second correction
✗ DISPROVEN — same driver + newer libsycl makes the aspect appear

The reasoning

To isolate IGC, a single-machine container experiment held the GPU fixed and upgraded only the user-space IGC/runtime (via --device /dev/dri passthrough):

EnvironmentIGCruntime / L0ext_oneapi_matrixjm_probe
Host (bare metal)2.32.726.14 / 1.15.37833NOFAIL
Container igc-newer2.34.426.18 / 1.15.38308NOFAIL

IGC 2.34.4 — the supposed "PASS" version from Theory 2 — still failed here. Since the device reported ext_oneapi_matrix = NO in sycl-ls --verbose, the new conclusion was: the B70 (bmg_g31) silicon/driver simply does not advertise the aspect, full stop.

Why it's wrong

This correctly killed Theory 2 (IGC isn't sufficient) but drew the wrong new conclusion. The container only swaps user-space IGC + compute-runtime — it does not swap libsycl. The ext_oneapi_matrix query is answered by libsycl, which was 2025.3 in both rows. So this table actually proves "IGC doesn't matter" and accidentally held the true culprit fixed. The "missing aspect" was real — but it was libsycl refusing to report it, not the hardware lacking it.

Lesson

When you isolate variable A and the bug persists, make sure variable B (here libsycl) didn't silently stay fixed too. "The aspect is NO" is an observation; "the hardware can't" is an inference — and the inference was unsupported.
4

Root cause: it's libsycl — its matrix arch table is stale

2026-06-05 · definitive (CORRECTION #3)
✓ PROVEN — one machine, only libsycl changed

The single-variable experiment

Same physical GPU, same kernel driver (1.15.37833), same IGC (2.32.7). The only thing swapped is libsycl, by rebuilding the same arch_resolve / jm_probe sources against each runtime:

libsycl runtimeGPU / driverIGCarch_is(bmg_g31)has(ext_intel_matrix)jm_probe
oneAPI 2025.3.2 .so.80xe223 / 1.15.378332.32.7YESNOFAIL
nightly 2026-06-04 .so.90xe223 / 1.15.378332.32.7YESYESOK

Driver and IGC are byte-identical across rows. Only libsycl differs → the result flips.

The smoking gun — a self-contradiction inside 2025.3

On 2025.3 the runtime simultaneously reports:

ext_oneapi_architecture_is(intel_gpu_bmg_g31)  -> YES
has(aspect::ext_intel_matrix)                  -> NO   // contradiction!

But upstream device_impl.hpp defines:

// CASE(ext_intel_matrix):
return any_of(supported_archs, architecture_is);  // supported_archs[] INCLUDES bmg_g31

If arch_is(bmg_g31) is YES, then matrix must be YES — unless the compiled supported_archs[] table in that .so predates the bmg_g31 entry. The 2025.3 table does; the nightly's includes it.

Why strings | grep can't see it

The bmg_g31 string is present in both .so files — it backs architecture_is. Only the runtime has() query, which walks the compiled matrix table, distinguishes them. You cannot grep your way to this; you have to run the query.

Why the earlier theories were all passengers

  • Theory 3 (silicon): same driver + newer libsycl → the aspect appears and the DPAS tile runs. The hardware was always capable.
  • Theory 2 (IGC): IGC stayed 2.32.7 across the PASS row. IGC 2.34 "fixing" it elsewhere was a co-varying proxy — those machines also had newer libsycl. Once libsycl advertises the aspect, IGC 2.32.7 lowers the tile fine.
  • Theory 1 (build flags): byte-identical device images; the gate is entirely runtime-side.

Route B — proving it on ARK's real kernel, torch-free

2026-06-05 · end-to-end verification
✓ VERIFIED — production IGemmDQCore, not a probe

Why a torch-free harness was needed

The full ARK UT links torch 2.11.0+xpu, which hard-pins intel-sycl-rt==2025.3.2 and links libsycl.so.8. torch and ARK share one process, so the 2026 runtime (.so.9) can't be swapped under the torch UT without also rebuilding torch. The probe proved the mechanism, but skeptics could ask: is ARK's actual kernel really the same path?

The harness

bestla/bestla/sycl/sycl_gemm.h — where IGemmDQCore lives — is header-only SYCL, no oneDNN, no torch. A ~70-line main() calls the exact launch that DnnlWrapper::sycl_igemm_s8s8 instantiates:

using T = float;
xmx::IGemmDQParam param;
param.A_d=A; param.B_d=B; param.C_d=C;
param.m=m; param.n=n; param.k=k;
param.lda=k; param.ldb=k; param.ldc=n;        // same layout as sycl_igemm_s8s8
param.Bias=nullptr; param.scaleA=sA; param.scaleB=sB;

// the real production joint_matrix kernel:
Launcher<xmx::IGemmDQCfg<T>, xmx::IGemmDQCore>::run(&q, param);

Same source, same GPU/driver/IGC — only libsycl differs

Harness buildlibsycl runtimeResult
icpx 2025.3.3, spir64 → links .so.82025.3.2 venv .so.8FAIL — "no matrix hardware … joint_matrix is not supported"
nightly clang++ 2026-06-04 → links .so.92026.0.0 .so.9PASS — kernel launched & completed

Device 0xe223 (bmg_g31) and L0 driver 1.15.37833 identical across both rows. This is the production IGemmDQCore joint_matrix kernel, not a proxy.

Build flags that matter (learned the hard way)

  • -std=c++20 — the header uses C++20 dependent-type syntax (using Param = CFG::Param;).
  • -DBTLA_SYCL — gates the entire sycl_gemm.h body (it's #ifdef'd out otherwise).
  • -I<kernel>/bestla, include bestla/sycl/sycl_wrapper.h — pulls sycl_utils.h for nd_item_helper (including sycl_gemm.h alone misses it).

The one remaining blocker

True in-process e2e (Route A) needs a 2026-based torch + ARK rebuilt against .so.9, because the SONAME bump (.so.8.so.9) means no LD_LIBRARY_PATH drop-in works. The kernel itself is proven fixed; only the torch packaging pin stands between here and a green UT.

Why vllm-xpu-kernels never hits this Same DPAS hardware, two different compiler paths — only one goes through the broken stage.

A recurring confusion: "vllm also uses XMX/DPAS on the same hardware — why does it work?" There are two ways to reach the same Battlemage DPAS unit, through different compiler stages:

✗ ARK IGemmDQCore — the failing path
C++ joint_matrix + joint_matrix_mad
high-level matrix abstraction
libsycl emits SPIR-V OpJointMatrixMad
*** runtime libsycl gate ***
2025.3 table lacks bmg_g31 →
refuses to advertise matrix →
THROWS before any IGC lowering
dpas on XMX (only on a runtime whose table has bmg_g31)
✓ vllm sycl-tla XE_DPAS_TT — immune
C++ XE_DPAS_TT<M,s8,s8,d>::fma()
lowers to INLINE ASSEMBLY:
asm("dpas.s8.s8.8 …")
the dpas is hand-written in the header
no joint_matrix op, no capability,
nothing for the runtime to gate
IGC just assembles the given dpas
(any runtime / any IGC works)

Two independent reasons vllm-xpu-kernels is unaffected — either alone is sufficient, and it has both:

  1. No joint_matrix anywhere. grep -rI joint_matrix over the whole tree returns zero hits. Its matrix math is CuTe XE_DPAS_TT (inline dpas asm) or oneDNN dnnl::matmul — neither emits the gated op. (Same reason torch bf16/f16 matmul, which is oneDNN, already passes on the FAIL node.)
  2. AOT compilation. vllm builds with -fsycl-targets=spir64_gen -device bmg — device code is lowered to real Battlemage ISA at build time. ARK ships generic SPIR-V (spir64) and JIT-lowers on the deploy node — which is exactly why the deploy node's runtime became the deciding factor.

The fix & escape hatches

PRIMARY

Upgrade the runtime

Ship/link a DPC++ libsycl whose matrix supported_archs[] includes bmg_g31 (intel/llvm ≥ 2026-06-04, or a oneAPI release carrying that entry). No driver, IGC, or hardware change needed.

Caveat: SONAME bumps .so.8.so.9; the consumer (torch-xpu / ARK .so) must be built against it — no LD_LIBRARY_PATH drop-in.

ESCAPE

Route m>1 through oneDNN

Send w4 m>1 through the oneDNN fp GEMM instead of the fused int8 joint_matrix kernel. oneDNN does its own DPAS codegen and never touches the gated op.

ESCAPE

Inline-dpas kernel

Replace ARK's joint_matrix int8 kernel with hand-written dpas asm (à la XE_DPAS_TT), or switch ARK to AOT (spir64_gen -device bmg). Keeps fused int8 perf, removes all runtime sensitivity.

Reproduce

# Root-cause probe (single machine, only libsycl changes):
# 1) baseline 2025.3 (FAIL): matrix aspect NO
LD_LIBRARY_PATH=/root/torch-xpu-setup/.venv/lib ONEAPI_DEVICE_SELECTOR=level_zero:gpu /root/arch_resolve
# 2) rebuild against 2026 nightly, re-run (PASS): matrix aspect YES
N=/root/sycl_nightly
$N/bin/clang++ -fsycl -O2 /root/arch_resolve.cpp -o /root/arch_resolve_nightly
ONEAPI_DEVICE_SELECTOR=level_zero:gpu /root/arch_resolve_nightly   # -> has(ext_intel_matrix)=YES

# Route B — the real ARK kernel, torch-free:
bash /root/build_harness_so8.sh && bash /root/run_harness_default.sh   # .so.8 on 2025.3.2 -> FAIL
bash /root/build_harness_so9.sh && bash /root/run_harness_2026.sh      # .so.9 on 2026.0.0 -> PASS