Tiny GPT-like character model in pure Go, inspired by the microgpt/makemore style. Training and interfacing GPTs using pure, dependency free Golang. Original author/inspiration: Andrej Karpathy - https://github.com/karpathy/makemore
- pkg.go.dev:
https://pkg.go.dev/github.com/eSlider/go-microgpt
- Trains a small decoder-only model on names.
- Uses a lightweight autograd engine implemented in Go.
- Generates new ("hallucinated") names after training.
- Supports native CLI execution and browser execution through WASM.
go run .If input.txt is missing in native mode, it is downloaded automatically from the names dataset URL in main.go.
Build wasm artifacts:
scripts/build_wasm.shServe static files:
cd web && python3 -m http.server 8080Open:
http://localhost:8080
Notes:
web/index.htmlis the loader UI.- Browser mode uses a built-in fallback mini dataset (WASM cannot read local files directly like native Go).
You can generate profiles using env vars:
MICROGPT_CPU_PROFILE=cpu.pprof MICROGPT_MEM_PROFILE=mem.pprof go run .Inspect:
go tool pprof -top ./go_microgpt cpu.pprof
go tool pprof -top -alloc_space ./go_microgpt mem.pprofThe codebase was optimized in multiple focused passes, keeping behavior the same while reducing allocations and runtime overhead.
Reference: https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95
- The original gist emphasizes algorithmic clarity first ("Everything else is just efficiency"), and this project follows that same core algorithmic structure.
- The Go version then focuses on efficiency engineering while keeping the algorithm simple and dependency-free.
- Same tiny-GPT teaching model style as the gist, but with production-oriented runtime engineering.
- Native CLI and browser/WASM execution paths in one codebase.
- Profiling-driven optimization workflow (
pprof) built into normal runs. - End-to-end speed improvement of about 7.9x on the same machine (
23.40s -> 2.97s), i.e. about 87.31% less runtime. - Allocation profile reduction from about 11.3GB to about 1.28GB total alloc space in profiled runs.
- Measured against the original Python gist on this machine, current optimized Go runtime is about 108.96x faster (
5:23.61vs2.97s) for a full run:- about 99.08% less runtime, or
- about 10,795.96% higher throughput-equivalent speed.
- Python memory (measured with
/usr/bin/time -v): peak RSS 60,220 KB (~58.8 MB) for the original gist run (5:29.06). - Metric note: Go memory figures above come from
pprofalloc-space and are not directly the same metric as peak RSS. - Main lesson from this Go implementation: most gains came from graph/memory/layout optimizations (fused ops, pooling, scratch reuse), not from adding more goroutines alone.
-
Adaptive concurrency in GPT forward
- Parallelized Q/K/V and attention heads only when work is large enough.
- Avoided goroutine overhead on tiny workloads.
-
Reduced attention allocations
- Removed temporary per-head key/value slice construction.
- Indexed directly into cached layer vectors.
-
Precomputed layer parameter keys
- Removed repeated
fmt.Sprintfcalls on hot token loops.
- Removed repeated
-
Inference-only numeric fast path
- Added no-autograd inference (
float64) for sampling. - Avoided building autograd graphs during generation.
- Added no-autograd inference (
-
Autograd graph memory optimization
- Added
sync.Poolfor temporaryValuenodes. - Replaced per-backward visited map with mark-based traversal.
- Released graph nodes after each step.
- Added
-
Tensor-style cross-entropy head during training
- Replaced autograd
Softmax->Log->Negchain with numeric CE and direct logits gradients (probs - onehot).
- Replaced autograd
-
Fused autograd primitives
- Added
Dot(...)andWeightedSum(...)ops to collapse manyAdd/Mulnodes into single nodes.
- Added
-
Pooled buffers for fused ops
- Reused children/local-grad slices for common small sizes (8/16/32).
-
Step-level scratch reuse
- Reused keys/values/loss buffers/logit scratch across train and inference loops.
-
Fused RMSNorm autograd
- Implemented analytic RMSNorm local gradients in a single custom op path.
-
Range-over-int cleanup
- Updated counted loops to
for i := range nstyle on Go 1.25.
- Updated counted loops to
Measurements below are from repeated end-to-end go run . runs on the same machine during optimization work.
They are approximate and include run-to-run noise, but show the trend clearly.
| Stage | Elapsed |
|---|---|
| Karpathy original Python gist (this machine) | 5:23.61 |
| Early baseline | 23.40s |
| Adaptive concurrency + key/allocation cleanup | 22.27s |
| Numeric inference path | 21.71s |
| Graph pooling + mark traversal | 15.87s |
| Numeric CE training head | 14.98s |
| Fused Dot/WeightedSum ops | 3.17s |
| Buffer pooling + scratch reuse + fused RMSNorm | 2.97s |
Overall speedup from the recorded baseline: ~7.9x (23.40s -> 2.97s), i.e. about 87.31% runtime reduction.
- Tune
GOGC(200/300) for this allocation profile. - Add microbenchmarks (
go test -bench) for key kernels. - For bigger models: move to contiguous tensor core and/or BLAS/GPU backend.