Go-MicroGPT

Tiny GPT-like character model in pure Go, inspired by the microgpt/makemore style. Training and interfacing GPTs using pure, dependency free Golang. Original author/inspiration: Andrej Karpathy - https://github.com/karpathy/makemore

Package Documentation

pkg.go.dev: https://pkg.go.dev/github.com/eSlider/go-microgpt

What this project does

Trains a small decoder-only model on names.
Uses a lightweight autograd engine implemented in Go.
Generates new ("hallucinated") names after training.
Supports native CLI execution and browser execution through WASM.

Run (native)

go run .

If input.txt is missing in native mode, it is downloaded automatically from the names dataset URL in main.go.

Run in browser (WASM)

Build wasm artifacts:

scripts/build_wasm.sh

Serve static files:

cd web && python3 -m http.server 8080

Open:

http://localhost:8080

Notes:

web/index.html is the loader UI.
Browser mode uses a built-in fallback mini dataset (WASM cannot read local files directly like native Go).

Profiling support

You can generate profiles using env vars:

MICROGPT_CPU_PROFILE=cpu.pprof MICROGPT_MEM_PROFILE=mem.pprof go run .

Inspect:

go tool pprof -top ./go_microgpt cpu.pprof
go tool pprof -top -alloc_space ./go_microgpt mem.pprof

Optimization summary

The codebase was optimized in multiple focused passes, keeping behavior the same while reducing allocations and runtime overhead.

Comparison to Karpathy gist

Reference: https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95

The original gist emphasizes algorithmic clarity first ("Everything else is just efficiency"), and this project follows that same core algorithmic structure.
The Go version then focuses on efficiency engineering while keeping the algorithm simple and dependency-free.

What is achieved now with Golang

Same tiny-GPT teaching model style as the gist, but with production-oriented runtime engineering.
Native CLI and browser/WASM execution paths in one codebase.
Profiling-driven optimization workflow (pprof) built into normal runs.
End-to-end speed improvement of about 7.9x on the same machine (23.40s -> 2.97s), i.e. about 87.31% less runtime.
Allocation profile reduction from about 11.3GB to about 1.28GB total alloc space in profiled runs.
Measured against the original Python gist on this machine, current optimized Go runtime is about 108.96x faster (5:23.61 vs 2.97s) for a full run:
- about 99.08% less runtime, or
- about 10,795.96% higher throughput-equivalent speed.
Python memory (measured with /usr/bin/time -v): peak RSS 60,220 KB (~58.8 MB) for the original gist run (5:29.06).
Metric note: Go memory figures above come from pprof alloc-space and are not directly the same metric as peak RSS.
Main lesson from this Go implementation: most gains came from graph/memory/layout optimizations (fused ops, pooling, scratch reuse), not from adding more goroutines alone.

Major optimizations applied

Adaptive concurrency in GPT forward
- Parallelized Q/K/V and attention heads only when work is large enough.
- Avoided goroutine overhead on tiny workloads.
Reduced attention allocations
- Removed temporary per-head key/value slice construction.
- Indexed directly into cached layer vectors.
Precomputed layer parameter keys
- Removed repeated fmt.Sprintf calls on hot token loops.
Inference-only numeric fast path
- Added no-autograd inference (float64) for sampling.
- Avoided building autograd graphs during generation.
Autograd graph memory optimization
- Added sync.Pool for temporary Value nodes.
- Replaced per-backward visited map with mark-based traversal.
- Released graph nodes after each step.
Tensor-style cross-entropy head during training
- Replaced autograd Softmax->Log->Neg chain with numeric CE and direct logits gradients (probs - onehot).
Fused autograd primitives
- Added Dot(...) and WeightedSum(...) ops to collapse many Add/Mul nodes into single nodes.
Pooled buffers for fused ops
- Reused children/local-grad slices for common small sizes (8/16/32).
Step-level scratch reuse
- Reused keys/values/loss buffers/logit scratch across train and inference loops.
Fused RMSNorm autograd
- Implemented analytic RMSNorm local gradients in a single custom op path.
Range-over-int cleanup
- Updated counted loops to for i := range n style on Go 1.25.

Measured runtime comparison

Measurements below are from repeated end-to-end go run . runs on the same machine during optimization work. They are approximate and include run-to-run noise, but show the trend clearly.

Stage	Elapsed
Karpathy original Python gist (this machine)	5:23.61
Early baseline	23.40s
Adaptive concurrency + key/allocation cleanup	22.27s
Numeric inference path	21.71s
Graph pooling + mark traversal	15.87s
Numeric CE training head	14.98s
Fused Dot/WeightedSum ops	3.17s
Buffer pooling + scratch reuse + fused RMSNorm	2.97s

Overall speedup from the recorded baseline: ~7.9x (23.40s -> 2.97s), i.e. about 87.31% runtime reduction.

Remaining improvement ideas

Tune GOGC (200/300) for this allocation profile.
Add microbenchmarks (go test -bench) for key kernels.
For bigger models: move to contiguous tensor core and/or BLAS/GPU backend.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmarks		benchmarks
scripts		scripts
web		web
README.md		README.md
go.mod		go.mod
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go-MicroGPT

Package Documentation

What this project does

Run (native)

Run in browser (WASM)

Profiling support

Optimization summary

Comparison to Karpathy gist

What is achieved now with Golang

Major optimizations applied

Measured runtime comparison

Remaining improvement ideas

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Go-MicroGPT

Package Documentation

What this project does

Run (native)

Run in browser (WASM)

Profiling support

Optimization summary

Comparison to Karpathy gist

What is achieved now with Golang

Major optimizations applied

Measured runtime comparison

Remaining improvement ideas

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages