This repository describes and implements a self-contained Large Language Model (LLM) in Rust, consisting of a compact Transformer architecture, a Byte Pair Encoding (BPE) tokenizer, a simple training loop for pretraining and instruction tuning, and a robust checkpoint format that consistently saves and loads both the tokenizer and model parameters.
The project addresses the practical necessity of developing a complete LLM as a closed system in which data pipeline, tokenization, model, optimization, inference, and persistence are integrated into a reproducible workflow. It deliberately prioritizes a transparent implementation suitable for expert analysis, debugging, and incremental extension.
Implemented features:
- AdamW
- Temperature, top-k, top-p
- Multi-head attention
- RMSNorm
- residual dropout
- Project overview
- Architecture and components
- Tokenizer (BPE) and determinism
- Training (pretraining and instruction tuning)
- Inference (greedy decoding)
- Checkpoints (save and load with rebuild)
- Data formats
- Build and run
- Security and robustness
- Roadmap toward a complete LLM
- License and contact
The repository implements a compact Transformer pipeline comprising embeddings, multiple Transformer blocks (self-attention, feed-forward, layer normalization), and an output projection onto the token vocabulary. Tokenization is performed via a BPE tokenizer that is persisted in the checkpoint to ensure consistent vocabulary sizes and, consequently, consistent parameter shapes for the output projection.
A central feature is checkpoint loading with rebuild: when loading, the model is rebuilt based on the tokenizer vocabulary stored in the checkpoint in order to avoid shape mismatches that would otherwise arise as soon as the vocabulary size differs between training and inference.
The implementation is consolidated into a small number of modules and emphasizes self-contained executability.
main.rs- CLI with menu loop
- Train, Save, Load, Ask
- Initial tokenizer training for immediate usability
layer.rs- Core of the model (Layer trait, Embeddings, Self Attention, Feed Forward, LayerNorm, TransformerBlock, OutputProjection)
- Optimizer (Adam)
- Llm with Train, Predict, Save, and Load
- Checkpoint structure
LlmCheckpoint
tokenizer.rs- BPE training, encoding, decoding
- Deterministic training logic and reproducible configuration
- Tokenizer checkpoint structure
BpeTokenizerCheckpoint
train.rs- Dataset loader for JSON and CSV
utils.rs- ASCII normalization and utility functions
- JSON serialization and atomic file writing
math.rs- Softmax, cross-entropy, gradient computation, gradient clipping
The default configuration (as of the current state) uses the following hyperparameters:
MAX_SEQ_LEN = 80EMBEDDING_DIM = 128HIDDEN_DIM = 256- 3 Transformer blocks
- Output projection dimension:
vocab_sizefrom the tokenizer vocabulary
The BPE tokenizer is trained from the corpus and uses an ASCII-focused preprocessing pipeline that segments whitespace and conservatively separates punctuation to obtain a robust token structure for simple training data.
Reproducibility is supported by a configuration object BpeTokenizerConfig, which is persisted in the checkpoint. This enables subsequent analysis of the tokenizer state and its training parameters—an important requirement for evaluation, debugging, and A/B comparisons (cf. Goodfellow, Bengio, & Courville, 2016).
The project distinguishes two training phases:
- Pretraining on generic texts
- Instruction tuning on chat or dialogue data
Both phases use the same training loop: the model is trained autoregressively for next-token prediction by deriving input tokens and target tokens from a token sequence shifted by one position. The loss is computed via cross-entropy, and gradients propagate backward through the layers via backpropagation.
Gradient clipping is applied to reduce numerical instability, particularly for small models and non-optimized initializations; in practice, this measure often contributes to stabilization (Pascanu, Mikolov, & Bengio, 2013).
Inference uses greedy decoding by computing softmax probabilities for the last token in the sequence and selecting the maximum until an EOS token is produced or MAX_SEQ_LEN is reached.
This strategy is intentionally simple to ensure system completeness and traceability, although in production scenarios sampling methods such as top-k or nucleus sampling and temperature scaling typically yield better text quality (Holtzman et al., 2020).
Because the output projection matrix has shape [embedding_dim, vocab_size], vocab_size directly depends on the tokenizer vocabulary size. Consequently, a different tokenizer at load time inevitably leads to parameter mismatches.
Therefore, the following procedure is used during loading via Llm::load_checkpoint_rebuild:
- Load and validate the checkpoint JSON
- Reconstruct the tokenizer from the checkpoint
- Rebuild the model (embeddings and output projection with
vocab_sizefrom the checkpoint) - Apply the parameter vector to the newly created layers
When saving, an atomic write strategy is used (temporary file, then rename) to prevent inconsistent checkpoints in the event of process termination or system issues—particularly relevant for long training runs and repeated saves.
The current dataset logic expects, for JSON, a list of strings.
Example pretraining_data.json:
json [ "Some training text.", "Another example sentence." ]
Example chat_training_data.json:
json [ "User: Hello Assistant: Hello, how can I help?", "User: Explain transformers. Assistant: ..." ]
CSV is read without a header, and each line is concatenated into a string. This format should be understood more as a simple adapter than as a semantically structured chat representation.
- Rust stable toolchain
- Cargo
bash cargo build --release
bash cargo run --release
The menu currently provides the following commands:
tTrain (pretraining and instruction tuning)sSave checkpointlLoad checkpoint (rebuild)aAsk (enter a prompt, generate an answer)eExit
The project implements validations and defensive defaults in several places, including:
- Checkpoint validation via magic value and version
- Parameter length checks when loading
- Learning rate validation
- Sequence length limiting
- Atomic checkpoint saving
- ASCII normalization to reduce Unicode edge cases
At the same time, it should be noted that the system is intended as a research and development foundation rather than a production-ready LLM runtime, especially since sampling, efficient batching logic, mixed precision, KV cache, and formal test suites are not yet fully implemented.
A complete LLM in terms of scalability, robust evaluation paths, and production-grade inference typically requires several technical extensions, which may be considered next steps for this repository:
- Tokenizer extensions
- Support for Unicode normalization and controlled byte fallbacks
- Consistent special tokens and clear prompt templates for chat
- Training infrastructure
- Mini-batches, shuffling, gradient accumulation
- Validation split, early stopping, and metrics
- Mixed precision and more stable initializations
- Inference quality
- Repetition penalty and stopping criteria
- KV cache for efficient autoregression
- Model architecture
- Positional embeddings or RoPE
- Attention mask handling for padding and batches
- Persistence and compatibility
- Explicit compatibility rules between checkpoint versions
- Optional sharding of large parameter vectors
- Testability and verification
- Unit tests for tokenizer, softmax stability, checkpoint roundtrip
- Golden tests for deterministic runs