joint_matrix — The Debug JourneyHow an ARK m>1 WOQ GEMM "no matrix hardware on the target device" crash on a Battlemage B70 was chased through three wrong turns to a single-variable root cause — and then proven on the real production kernel, torch-free.
The crash joint_matrix is not supported is decided by one variable only: the
libsycl (DPC++ runtime) version. Holding the GPU, kernel driver, and IGC all
byte-identical, swapping libsycl from oneAPI 2025.3.2 (.so.8)
to a 2026 build (.so.9) flips the result FAIL → PASS.
The 2025.3 runtime's compiled matrix supported_archs[] table pre-dates the
bmg_g31 entry, so it refuses to advertise ext_intel_matrix on this device
even though the silicon, driver, and IGC are all capable.
Verified on ARK's actual production kernel (IGemmDQCore via a torch-free
harness), not just a probe. The fix: ship/link a DPC++ runtime whose table includes
bmg_g31.
The ARK backend UT quantizes facebook/opt-125m to w4g128, writes shards, and repacks
to XPU — all fine. It dies inside model.generate():
auto_round_kernel/qlinear.py:242 -> ark.woqgemm(...)
auto_round_kernel/__init__.py:246 -> lib.woqgemm(...)
RuntimeError: no matrix hardware on the target device, joint_matrix is not supported
Two early facts shaped everything after:
bf16/f16,
which routes through oneDNN) works on this exact device. So the silicon and L0 driver expose DPAS.joint_matrix. The crash is specifically ARK's hand-written
IGemmDQCore int8 tile (M8×N16×K32, sub_group=16), reached via
woq_s8 → sycl_igemm_s8s8 for m>1 / prefill..so was built without the matrix-MMA SPIR-V extension"Inspecting the shipped auto_round_lib 0.13.1 wheel, the device image contained
SPV_INTEL_split_barrier but was missing
SPV_INTEL_subgroup_matrix_multiply_accumulate and SPV_INTEL_2d_block_io —
symptoms of an ARK_SYCL_TLA=OFF build. In CMakeLists.txt the matrix-MMA
extension is declared only inside SYCL_TLA_LINK_FLAGS (gated on
ARK_XPU AND ARK_SYCL_TLA), yet IGemmDQCore uses joint_matrix
unconditionally. Conclusion at the time: a TLA-off build emits joint_matrix ops without
declaring the capability → runtime can't lower them → throw.
The C++ joint_matrix API already emits +SPV_INTEL_joint_matrix by
default. A jm_probe built with the explicit
+SPV_INTEL_subgroup_matrix_multiply_accumulate flag is byte-identical to one built
without it — the extra flag adds a different, unused extension. Both fail identically. No build flag,
and no ARK_SYCL_TLA=ON rebuild, changes the outcome.
"no matrix hardware…" lives in
libsycl.so.8 (the DPC++ runtime) — not in ARK and not in the driver. That was the first
clue the real determinant was the runtime, not the build. It took two more turns to act on it.The same wheel binary PASSES on one node and FAILS
on another. The visible difference was IGC — the JIT compiler that lowers joint_matrix
SPIR-V into Battlemage DPAS:
| PASS node | FAIL node | |
|---|---|---|
| Wheel md5 | 88c363a4… | 88c363a4… (identical) |
| libsycl | 2025.3.2.20260112 | 2025.3.2.20260112 (identical) |
| IGC | 2.34.4 | 2.32.7 |
| Device | Arc Pro B60 | Graphics 0xe223 (B70) |
| Result | PASSED | FAILED |
The conclusion — "upgrade libigc2 to ≥ 2.34" — was acted on as THE FIX.
This compared two different physical machines. Hardware (B60 vs B70) and the rest of the software stack co-varied with IGC. The correlation was real but not isolated — the B60 node also had a newer everything. IGC turned out to be a passenger, not the driver. (Theory 4 later holds IGC fixed at 2.32.7 and still flips the result.)
To isolate IGC, a single-machine container experiment held the GPU fixed and upgraded only
the user-space IGC/runtime (via --device /dev/dri passthrough):
| Environment | IGC | runtime / L0 | ext_oneapi_matrix | jm_probe |
|---|---|---|---|---|
| Host (bare metal) | 2.32.7 | 26.14 / 1.15.37833 | NO | FAIL |
Container igc-newer | 2.34.4 | 26.18 / 1.15.38308 | NO | FAIL |
IGC 2.34.4 — the supposed "PASS" version from Theory 2 — still failed here. Since the device
reported ext_oneapi_matrix = NO in sycl-ls --verbose, the new conclusion was:
the B70 (bmg_g31) silicon/driver simply does not advertise the aspect, full stop.
This correctly killed Theory 2 (IGC isn't sufficient) but drew the wrong new conclusion. The
container only swaps user-space IGC + compute-runtime — it does not swap
libsycl. The ext_oneapi_matrix query is answered by libsycl,
which was 2025.3 in both rows. So this table actually proves "IGC doesn't matter" and
accidentally held the true culprit fixed. The "missing aspect" was real — but it was
libsycl refusing to report it, not the hardware lacking it.
libsycl) didn't silently stay fixed too. "The aspect is NO" is an
observation; "the hardware can't" is an inference — and the inference was unsupported.libsycl — its matrix arch table is staleSame physical GPU, same kernel driver (1.15.37833), same IGC (2.32.7). The
only thing swapped is libsycl, by rebuilding the same arch_resolve /
jm_probe sources against each runtime:
| libsycl runtime | GPU / driver | IGC | arch_is(bmg_g31) | has(ext_intel_matrix) | jm_probe |
|---|---|---|---|---|---|
| oneAPI 2025.3.2 .so.8 | 0xe223 / 1.15.37833 | 2.32.7 | YES | NO | FAIL |
| nightly 2026-06-04 .so.9 | 0xe223 / 1.15.37833 | 2.32.7 | YES | YES | OK |
Driver and IGC are byte-identical across rows. Only libsycl differs → the result flips.
On 2025.3 the runtime simultaneously reports:
ext_oneapi_architecture_is(intel_gpu_bmg_g31) -> YES
has(aspect::ext_intel_matrix) -> NO // contradiction!
But upstream device_impl.hpp defines:
// CASE(ext_intel_matrix):
return any_of(supported_archs, architecture_is); // supported_archs[] INCLUDES bmg_g31
If arch_is(bmg_g31) is YES, then matrix must be YES — unless the compiled
supported_archs[] table in that .so predates the bmg_g31
entry. The 2025.3 table does; the nightly's includes it.
strings | grep can't see itbmg_g31
string is present in both .so files — it backs architecture_is.
Only the runtime has() query, which walks the compiled matrix table, distinguishes them.
You cannot grep your way to this; you have to run the query.The full ARK UT links torch 2.11.0+xpu, which hard-pins
intel-sycl-rt==2025.3.2 and links libsycl.so.8. torch and ARK share one
process, so the 2026 runtime (.so.9) can't be swapped under the torch UT without also
rebuilding torch. The probe proved the mechanism, but skeptics could ask: is ARK's actual kernel
really the same path?
bestla/bestla/sycl/sycl_gemm.h — where IGemmDQCore lives — is
header-only SYCL, no oneDNN, no torch. A ~70-line main() calls the exact
launch that DnnlWrapper::sycl_igemm_s8s8 instantiates:
using T = float;
xmx::IGemmDQParam param;
param.A_d=A; param.B_d=B; param.C_d=C;
param.m=m; param.n=n; param.k=k;
param.lda=k; param.ldb=k; param.ldc=n; // same layout as sycl_igemm_s8s8
param.Bias=nullptr; param.scaleA=sA; param.scaleB=sB;
// the real production joint_matrix kernel:
Launcher<xmx::IGemmDQCfg<T>, xmx::IGemmDQCore>::run(&q, param);
| Harness build | libsycl runtime | Result |
|---|---|---|
icpx 2025.3.3, spir64 → links .so.8 | 2025.3.2 venv .so.8 | FAIL — "no matrix hardware … joint_matrix is not supported" |
nightly clang++ 2026-06-04 → links .so.9 | 2026.0.0 .so.9 | PASS — kernel launched & completed |
Device 0xe223 (bmg_g31) and L0 driver 1.15.37833 identical across both
rows. This is the production IGemmDQCore joint_matrix kernel, not a proxy.
-std=c++20 — the header uses C++20 dependent-type syntax (using Param = CFG::Param;).-DBTLA_SYCL — gates the entire sycl_gemm.h body (it's #ifdef'd out otherwise).-I<kernel>/bestla, include bestla/sycl/sycl_wrapper.h — pulls
sycl_utils.h for nd_item_helper (including sycl_gemm.h alone misses it)..so.9, because the SONAME bump
(.so.8→.so.9) means no LD_LIBRARY_PATH drop-in works. The kernel
itself is proven fixed; only the torch packaging pin stands between here and a green UT.A recurring confusion: "vllm also uses XMX/DPAS on the same hardware — why does it work?" There are two ways to reach the same Battlemage DPAS unit, through different compiler stages:
joint_matrix + joint_matrix_madOpJointMatrixMadXE_DPAS_TT<M,s8,s8,d>::fma()asm("dpas.s8.s8.8 …")Two independent reasons vllm-xpu-kernels is unaffected — either alone is sufficient, and it has both:
joint_matrix anywhere. grep -rI joint_matrix over the whole tree
returns zero hits. Its matrix math is CuTe XE_DPAS_TT (inline dpas asm) or
oneDNN dnnl::matmul — neither emits the gated op. (Same reason torch bf16/f16 matmul, which
is oneDNN, already passes on the FAIL node.)-fsycl-targets=spir64_gen -device bmg — device
code is lowered to real Battlemage ISA at build time. ARK ships generic SPIR-V
(spir64) and JIT-lowers on the deploy node — which is exactly why the deploy node's runtime
became the deciding factor.Ship/link a DPC++ libsycl whose matrix supported_archs[] includes
bmg_g31 (intel/llvm ≥ 2026-06-04, or a oneAPI release carrying that entry). No driver,
IGC, or hardware change needed.
Caveat: SONAME bumps .so.8→.so.9; the consumer (torch-xpu / ARK
.so) must be built against it — no LD_LIBRARY_PATH drop-in.
Send w4 m>1 through the oneDNN fp GEMM instead of the fused int8 joint_matrix kernel.
oneDNN does its own DPAS codegen and never touches the gated op.
Replace ARK's joint_matrix int8 kernel with hand-written dpas asm (à la
XE_DPAS_TT), or switch ARK to AOT (spir64_gen -device bmg). Keeps fused int8
perf, removes all runtime sensitivity.
# Root-cause probe (single machine, only libsycl changes):
# 1) baseline 2025.3 (FAIL): matrix aspect NO
LD_LIBRARY_PATH=/root/torch-xpu-setup/.venv/lib ONEAPI_DEVICE_SELECTOR=level_zero:gpu /root/arch_resolve
# 2) rebuild against 2026 nightly, re-run (PASS): matrix aspect YES
N=/root/sycl_nightly
$N/bin/clang++ -fsycl -O2 /root/arch_resolve.cpp -o /root/arch_resolve_nightly
ONEAPI_DEVICE_SELECTOR=level_zero:gpu /root/arch_resolve_nightly # -> has(ext_intel_matrix)=YES
# Route B — the real ARK kernel, torch-free:
bash /root/build_harness_so8.sh && bash /root/run_harness_default.sh # .so.8 on 2025.3.2 -> FAIL
bash /root/build_harness_so9.sh && bash /root/run_harness_2026.sh # .so.9 on 2026.0.0 -> PASS