Fine-tune LLMs to enhance coding capabilities using Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO). Includes a blazing-fast Python sandbox for safely running model-generated code.
A model trained from this repository using only 1,000 examples from the OpenCoder dataset achieved a 49.1% improvement in coding performance on the MBPP benchmark while maintaining general capabilities:
β¨ Try out the trained model, explore the metrics during training, or analyze the training artifacts.
Small language models (SLMs) are the key to fast, local coding agents, but they often struggle with complex programming tasks. Liquid AI's LFM2.5-1.2B-Instruct is exceptionally fast and efficient, but not optimized for coding out of the box.
LFM-Coder bridges this gap using RLVR. By training lightweight LoRA adapters (~22M parameters) with Hugging Face TRL, we provide the model with a high-fidelity execution environment to learn from real-time, verifiable feedback. This approach significantly enhances coding performance while maintaining the model's tiny footprint and general capabilities.
This repository goes beyond basic fine-tuning by implementing a production-grade RLVR environment and training pipeline:
- Dual-Engine Architecture: Seamlessly alternates between a blazing-fast Rust-based Python interpreter (Monty) and full-featured Docker/Podman containers.
- Massive Concurrency: Threaded execution across all CPU cores for both engines, enabling high-throughput reward computation essential for GRPO.
- Smart Dependency Management: Packages are installed dynamically based on code requirements. Local caching ensures subsequent runs load instantaneously and can run without network access.
- Enterprise-Grade Isolation: Configurable resource guards (CPU/memory), execution timeouts, and network isolation to ensure secure execution of model-generated code.
- Asynchronous Pipelining: Overlaps GPU completion generation with CPU-based code verification to maximize hardware utilization and minimize idle time.
- Optimized RLVR Pipeline: Leverages QLoRA (4-bit) and Liger kernels to enable advanced GRPO training on consumer hardware (8GB VRAM).
- Fault-Tolerant Workflows: Robust state management with automatic resumption for both training and evaluation cycles.
- Benchmark Sanitization: Identifies and repairs incorrect test cases in standard benchmarks (HumanEvalPlus/MBPPPlus) to ensure rigorous evaluation.
- Automated Validation: Verifies all training examples against provided solutions to guarantee data quality before RLVR begins.
- Granular Metrics: Heuristic-driven extraction that calculates per-test-case pass rates and provides detailed logs for model weakness analysis.
- Hardware: Single GPU with 8GB VRAM (e.g., RTX 4060).
- Tooling: uv installed.
git clone https://github.com/rparkr/lfm-coder.git && cd lfm-coder
export HF_TOKEN="your-hf-token"Update training_config.toml with your model_id and output_dir.
# Dry run to verify configuration
uv run lfm-coder --dry-run
# Start full training
uv run lfm-coderYou can use the high-performance sandbox in your own projects for safe execution of LLM-generated code.
uv add lfm-coder # or pip install lfm-coderThe Sandbox class automatically routes code between Monty (fast) and Docker (full support).
from lfm_coder.sandbox import Sandbox
sandbox = Sandbox()
# Batch execution (parallel)
results = sandbox.run(["1+1", "import math; math.sqrt(16)", "print('Hello')"])
for r in results:
print(f"Stdout: {r.stdout} | Result: {r.result}")code = """
import httpx # Requires Docker fallback
r = httpx.get('https://example.com')
print(r.status_code)
"""
result = sandbox.run(code)- Dual Sandboxes:
MontySandbox+DockerSandboxwith auto-routing. - Data Pipeline: Automated sampling, verification, and repair of benchmarks.
- RLVR Training: GRPO integration with TRL and GPU optimizations.
- Evaluation: Scoring module with GPU/CPU pipelining.
- Ollama support: Fix chat template in fine-tuned GGUF model for multi-turn chat.
| Metric | Monty Sandbox (Rust) | Docker Sandbox (Container) |
|---|---|---|
| Execution Count | 18,556 (77.3%) | 5,444 (22.7%) |
| Avg. Speed | 1.01 ms | 2,577 ms |
| Median Speed | 0.4 ms | 2,240 ms |
| Success Rate | 69.8% | 35.8% |
| Throughput | ~1,000 exec/sec | ~0.4 exec/sec |
Monty execution is 2,000x - 5,000x faster than the Docker fallback, providing the massive throughput required for efficient RLVR training.
- pydantic-monty for the lightning-fast Python sandbox.
- TRL and trackio for the RL framework and monitoring.
- Evalplus for the benchmark datasets.
- OpenCoder-LLM for training data.
- Liquid AI for the LFM2.5 model and GRPO guidance.
Code: MIT license.
Model Weights: LFM license (Commercial restriction for >$10M revenue orgs).