xLLM Ascend TileLang Kernel Development Guide
This document explains how to add or modify an Ascend TileLang kernel in xLLM. The examples use the current rope kernel throughout.
Relevant directories:
- Python kernel definitions:
xllm/xllm/compiler/tilelang/targets/ascend/kernels - NPU runtime wrappers:
xllm/xllm/core/kernels/npu/tilelang
Builds and tests should be run inside the NPU container.
1. First Decide What Kind of Change You Are Making
Section titled “1. First Decide What Kind of Change You Are Making”- Add a
specialization- Add one more compiled parameter combination to an existing kernel
- Reuse the same wrapper, the same runtime dispatch fields, and the same C ABI
- Typical changes are updates to
DISPATCH_SCHEMAorSPECIALIZATIONS
- Add a
kernel- Add a new logical operator
- Typical changes are a new Python kernel file, a new wrapper C++ file, and one CMake registration
For rope:
- Adding one more item like
{"variant_key": "...", "head_dim": ..., "rope_dim": ..., "dtype": ...}toSPECIALIZATIONSmeans adding a newspecialization - Adding a new external interface such as
xxx_wrapper.cppmeans adding a newkernel
2. Development Order
Section titled “2. Development Order”The recommended order is:
- Implement the TileLang kernel in a Python file such as
rope.py - Implement
generate_source(...)to lower the kernel into Ascend-C source - Declare
DISPATCH_SCHEMAandSPECIALIZATIONS - Generate
registry.inconce and inspect it - Then write or update the runtime specialization construction logic in the wrapper
- Wire it into CMake and run tests
The key idea behind this order is:
- implement the kernel itself first
- then fix the runtime dispatch schema
- then write the wrapper against the generated
registry.inc
3. Write the Python Kernel
Section titled “3. Write the Python Kernel”Using rope.py as the example, the Python side can be understood in three layers:
build_rope_kernel(...): kernel implementationgenerate_source(...): AOT exportRopeKernel: kernel registration plus dispatch schema and compiled instance declaration
3.1 Implement build_rope_kernel(...)
Section titled “3.1 Implement build_rope_kernel(...)”build_rope_kernel(...) is the actual TileLang kernel implementation. This is where you write:
@T.prim_func- input and output tensor shapes
- parallel task organization under
with T.Kernel(...) - UB allocation and the actual compute logic
The simplified structure in rope.py looks like this:
def build_rope_kernel( head_dim: int, rope_dim: int, vec_core_num: int, ub_buffer_bytes: int,): task_num = vec_core_num m_num = vec_core_num // 2
@T.prim_func def rope_in_place_kernel(...): with T.Kernel(m_num, is_npu=True) as (cid, vid): task_id = cid * 2 + vid ...
return rope_in_place_kernelHere, head_dim and rope_dim are the compile-time parameters that this implementation actually depends on.
Vector kernels such as rope must also follow the fixed-task convention used by the current AOT path. In the current AOT flow, the kernel launch block_num is fixed at compile time, which means:
- runtime input shapes do not change the kernel launch
block_num - runtime input shapes only change workload splitting across the fixed tasks
The convention in the current rope.py is:
task_num = vec_core_numm_num = vec_core_num // 2
with T.Kernel(m_num, is_npu=True) as (cid, vid): task_id = cid * 2 + vidThis means:
cidranges over[0, vec_core_num // 2)vidranges over[0, 2)- the total task count is fixed as
task_num = vec_core_num
As a result, rope.py also derives the compile-time token count for one specialization using the fixed task count:
max_rows_num_in_ub = _derive_max_rows_num_in_ub(...)compile_num_tokens = task_num * max_rows_num_in_ub3.2 Implement generate_source(...)
Section titled “3.2 Implement generate_source(...)”generate_source(...) lowers the TileLang kernel above into the final source code. The export layer takes one specialization and turns it into compilable Ascend-C source.
For rope, the core logic is:
@staticmethoddef generate_source(head_dim: int, rope_dim: int, dtype: str) -> str: vec_core_num = detect_vec_core_num() tilelang_kernel = build_rope_kernel( head_dim=head_dim, rope_dim=rope_dim, vec_core_num=vec_core_num, ub_buffer_bytes=FIXED_UB_BUFFER_BYTES, ) with tilelang.tvm.transform.PassContext(...): kernel = tilelang.engine.lower(tilelang_kernel) return kernel.kernel_sourceThe rules here are:
- the inputs to
generate_source(...)come from the currentSPECIALIZATIONSentry generate_source(...)callsbuild_rope_kernel(...)- the return value is the lowered source string
3.3 Declare DISPATCH_SCHEMA and SPECIALIZATIONS
Section titled “3.3 Declare DISPATCH_SCHEMA and SPECIALIZATIONS”After the kernel implementation and export layer are done, use an @register_kernel class to attach the kernel to the framework.
The current minimal template in rope.py is:
from ....common.spec import DispatchField, TilelangKernel, register_kernel
@register_kernelclass RopeKernel(TilelangKernel): DISPATCH_SCHEMA = [ DispatchField("head_dim", "int32"), DispatchField("rope_dim", "int32"), DispatchField("dtype", "dtype"), ] SPECIALIZATIONS = [ { "variant_key": "hd128_rd128_bf16", "head_dim": 128, "rope_dim": 128, "dtype": "bf16", }, { "variant_key": "hd576_rd64_bf16", "head_dim": 576, "rope_dim": 64, "dtype": "bf16", }, ]
@staticmethod def generate_source(head_dim: int, rope_dim: int, dtype: str) -> str: ...There are two concepts to distinguish here:
DISPATCH_SCHEMA- defines the field names, order, and types of the runtime specialization
- is the single source of truth for the C++ specialization struct, builder, and lookup interface
SPECIALIZATIONS- represents the set of instances that will actually be compiled
- each item corresponds to one variant
The rules are:
- every field in
DISPATCH_SCHEMAmust appear in everySPECIALIZATIONSitem SPECIALIZATIONSmay contain extra fields; those fields are passed intogenerate_source(...), but do not enter the runtime dispatch schemavariant_keyis the unique identifier for that specializationDISPATCH_SCHEMAandSPECIALIZATIONSmust match the runtime specialization one-to-one
For rope, the runtime dispatch dimensions are:
head_dimrope_dimdtype
So these three fields must appear in both:
DISPATCH_SCHEMA- every
SPECIALIZATIONSitem
At build time, Ascend build resolves the actual bisheng_arch from the --device a2|a3 value passed by the main build path.
3.4 Inspect the Generated Ascend-C Source
Section titled “3.4 Inspect the Generated Ascend-C Source”When debugging the implementation details in build_rope_kernel(...), or comparing how different kernel styles affect the final code generation, use the common compile-kernels entry to regenerate artifacts and inspect the Ascend-C source for the specialization you care about.
For rope, you can fix:
head_dim=576rope_dim=64dtype=bf16
Then regenerate the rope artifacts:
python xllm/compiler/tilelang_launcher.py compile-kernels \ --target ascend \ --device a3 \ --output-root /tmp/tilelang_debug \ --kernels rope \ --forceIt is recommended to keep --force so the source and object files are regenerated from the current code instead of reusing an old cache hit.
This command uses an isolated debug output directory, /tmp/tilelang_debug, so only the debug artifacts for rope are generated there and they do not get mixed with artifacts from other kernels in the main build directory.
After that, you can directly inspect the generated source for the specialization, including the entry function, UB allocation, and vector compute logic:
sed -n '1,200p' \ /tmp/tilelang_debug/targets/ascend/rope/hd576_rd64_bf16/rope_hd576_rd64_bf16_kernel.cpp
rg -n 'extern "C"|__global__|alloc_ub|alloc_shared|g_tilingKey' \ /tmp/tilelang_debug/targets/ascend/rope/hd576_rd64_bf16/rope_hd576_rd64_bf16_kernel.cppTo compare two kernel implementations, keep the specialization fixed, run compile-kernels --force before and after the change, then diff the generated .cpp file:
cp /tmp/tilelang_debug/targets/ascend/rope/hd576_rd64_bf16/rope_hd576_rd64_bf16_kernel.cpp \ /tmp/rope_before.cpp
diff -u /tmp/rope_before.cpp \ /tmp/tilelang_debug/targets/ascend/rope/hd576_rd64_bf16/rope_hd576_rd64_bf16_kernel.cppThis helps isolate specialization changes from kernel implementation changes.
After generation, the main files to inspect are:
/tmp/tilelang_debug/targets/ascend/rope/hd576_rd64_bf16/rope_hd576_rd64_bf16_kernel.cpp/tmp/tilelang_debug/targets/ascend/rope/registry.inc/tmp/tilelang_debug/targets/ascend/rope/manifest.json
These correspond to:
- the final Ascend-C source for one specialization
- the runtime dispatch interface directly included by the wrapper
- the full compiled artifact record for the current kernel
The recommended debugging sequence is:
- run
compile-kernels --forceto regenerate the current kernel artifacts - inspect the
.cppfor the specialization and analyze the code generation result - inspect
registry.incandmanifest.jsonto confirm they match expectations - finally run
rope_wrapper_testto check end-to-end behavior and performance
4. Update the Wrapper
Section titled “4. Update the Wrapper”When adding a new kernel, you need a new wrapper. When adding a new specialization, the wrapper only needs an update if the runtime specialization semantics change.
For rope_wrapper.cpp, the manually written parts should remain:
- tensor shape, dtype, and layout validation
- reshaping inputs into
x_rows / sin_rows / cos_rows - constructing the runtime specialization from tensors
- assembling launch arguments and calling
entry->fn(...)
4.1 What registry.inc Generates Automatically
Section titled “4.1 What registry.inc Generates Automatically”registry.inc is generated automatically from the Python-side DISPATCH_SCHEMA, SPECIALIZATIONS, and the exported Ascend-C ABI.
For rope, the generated content includes:
RopeSpecializationRopeHeadDimRopeRopeDimRopeDTypeRopeKernelFnmake_rope_specialization(...)find_rope_kernel_entry(...)available_rope_variant_keys()
For rope_wrapper.cpp, registry.inc directly provides dispatch-related definitions such as RopeSpecialization, operator==(...), and RopeKernelFn. Dtype conversion uses the shared helper to_tilelang_dtype(...).
4.2 What the Wrapper Actually Needs to Write
Section titled “4.2 What the Wrapper Actually Needs to Write”The most important handwritten logic in rope_wrapper.cpp is constructing the runtime specialization from the tensors. The current code looks like this:
RopeSpecialization build_runtime_specialization(const torch::Tensor& x_rows) { return make_rope_specialization( RopeHeadDim{static_cast<int32_t>(x_rows.stride(0))}, RopeRopeDim{static_cast<int32_t>(x_rows.size(1))}, RopeDType{to_tilelang_dtype(x_rows.scalar_type())});}For rope:
head_dimmaps tox_rows.stride(0), which is thex_strideused by the kernelrope_dimmaps tox_rows.size(1)dtypemaps tox_rows.scalar_type()
The runtime path is:
- the wrapper reshapes the inputs into
x_rows / sin_rows / cos_rows build_runtime_specialization(...)constructs a specialization fromx_rowsfind_rope_kernel_entry(...)performs an exact match in the static registry- after a match,
entry->fn(...)calls the actual compiled symbol
The current lookup strategy is a linear scan with exact matching. If any of head_dim, rope_dim, or dtype differs, the lookup will miss.
So when you add a new specialization, the main things to cross-check are:
- the field semantics in Python-side
DISPATCH_SCHEMA - the field values in Python-side
SPECIALIZATIONS - the field values constructed by
build_runtime_specialization(...)in the wrapper
All three must match exactly.
4.3 Generate and Inspect registry.inc First
Section titled “4.3 Generate and Inspect registry.inc First”Before writing or modifying wrapper code, generate registry.inc once and inspect it. Focus on:
- whether the generated field order in
RopeSpecializationmatches expectations - whether the generated wrapped field type names match expectations
- the parameter order of
make_rope_specialization(...) - the generated entry symbol names
registry.inc is the direct contract for the wrapper. Inspect it first, then write the wrapper against it.
5. Update CMake
Section titled “5. Update CMake”When adding a new kernel, register it in xllm/xllm/core/kernels/npu/tilelang/CMakeLists.txt.
CMake registration is unified through the high-level helper:
tilelang_register_runtime_kernel(NAME <kernel> WRAPPER_SRCS <srcs...>)
Using rope as the example, the minimal template is:
tilelang_register_runtime_kernel( NAME rope WRAPPER_SRCS rope_wrapper.cpp)This helper will:
- derive the manifest path as
TILELANG_GENERATED_ROOT/targets/ascend/<kernel>/manifest.json - import the manifest
- add the wrapper source and compiled objects into
tilelang_kernels - append the
XLLM_TL_<KERNEL>_REGISTRY_INC=...compile definition automatically
So when adding a new runtime kernel, the CMake-side work mainly consists of two things:
- make sure the Python side can already generate the manifest for that kernel
- add one
tilelang_register_runtime_kernel(...)entry in the TileLang CMakeLists
For day-to-day kernel additions, add one tilelang_register_runtime_kernel(...) line directly in CMake. tilelang_import_kernel_manifest(...) stays underneath as the implementation base for that higher-level helper.
6. Validate
Section titled “6. Validate”The recommended validation order is:
- compile the TileLang kernel and inspect the generated
registry.inc - then run the full wrapper test
Common commands:
python xllm/compiler/tilelang_launcher.py compile-kernels \ --target ascend \ --device a3 \ --output-root build/cmake.linux-aarch64-cpython-311/xllm/compiler/tilelang \ --kernels rope
python setup.py test --test-name rope_wrapper_test --device a3The first command generates manifest.json, registry.inc, and the object files. The second command validates the full integration path.