A tiny, hard CUDA kernel laboratory for AI hot-path operators.
KernelLab provides hand-written CUDA C++ kernels for the most critical operators in LLM inference and training. All kernels expose a pure C ABI, making them callable from Rust, Python, C, or any language with FFI support.
Operator execution authority — whoever controls kernel selection, memory access patterns, and numerical precision.
| Kernel | Description |
|---|---|
rmsnorm |
Root-mean-square normalization (LLM default) |
rope |
Rotary position embedding |
softmax |
Online safe softmax with optional mask |
silu |
SiLU / SwiGLU activation |
vec_add |
Element-wise vector addition (residual) |
kv_copy |
KV cache write |
quant_dequant |
INT8/FP8 quantization utilities |
int ak_rmsnorm_f16(void* out, const void* x, const void* weight,
int B, int T, int D, float eps, void* stream);
int ak_rope_f16(void* out, const void* x, const void* cos, const void* sin,
int B, int T, int D, void* stream);
int ak_softmax_f16(void* out, const void* x, const void* mask,
int B, int H, int T, int D, void* stream);
int ak_silu_f16(void* out, const void* x, int N, void* stream);
int ak_kv_copy_f16(void* out, const void* x,
int B, int H, int T, int D, void* stream);mkdir build && cd build
cmake .. -G Ninja
ninjacd build && ctest| Layer | Choice |
|---|---|
| Language | C11 + CUDA C++ |
| Build | CMake + Ninja |
| Test | CTest + cuda-memcheck |
| Benchmark | CUDA events |
apeinx-kernels/
├── include/akernel.h # Public C ABI
├── src/ # CPU reference implementations
├── cuda/ # CUDA GPU kernel implementations
├── bench/ # Micro-benchmarks
├── tests/ # Correctness tests (CPU vs CUDA)
└── CMakeLists.txt
KernelLab (C ABI / libakernel.so)
↑
ApeinxRT-Core (Rust, calls via FFI)
↑
Apeinx-IR (compiles .apxir → plan.json consumed by RT-Core)
↑
ApexTrain-Core (consumes trace.jsonl from RT-Core)
TBD