Stars
AI agents running research on single-GPU nanochat training automatically
A lightweight alternative to OpenClaw that runs in containers for security. Connects to WhatsApp, Telegram, Slack, Discord, Gmail and other messaging apps,, has memory, scheduled jobs, and runs dir…
Solutions of Reinforcement Learning, An Introduction
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
LongAttn :Selecting Long-context Training Data via Token-level Attention
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Flash Attention in 300-500 lines of CUDA/C++
Minimal yet performant LLM examples in pure JAX
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
Code release for paper "Test-Time Training Done Right"
An efficient implementation of the NSA (Native Sparse Attention) kernel
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
[ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"
Awesome LLM pre-training resources, including data, frameworks, and methods.
Awesome LLM pre-training resources, including data, frameworks, and methods.



