Skip to content

eSlider/go-microgpt

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go-MicroGPT

Tiny GPT-like character model in pure Go, inspired by the microgpt/makemore style. Training and interfacing GPTs using pure, dependency free Golang. Original author/inspiration: Andrej Karpathy - https://github.com/karpathy/makemore

Package Documentation

  • pkg.go.dev: https://pkg.go.dev/github.com/eSlider/go-microgpt

What this project does

  • Trains a small decoder-only model on names.
  • Uses a lightweight autograd engine implemented in Go.
  • Generates new ("hallucinated") names after training.
  • Supports native CLI execution and browser execution through WASM.

Run (native)

go run .

If input.txt is missing in native mode, it is downloaded automatically from the names dataset URL in main.go.

Run in browser (WASM)

Build wasm artifacts:

scripts/build_wasm.sh

Serve static files:

cd web && python3 -m http.server 8080

Open:

http://localhost:8080

Notes:

  • web/index.html is the loader UI.
  • Browser mode uses a built-in fallback mini dataset (WASM cannot read local files directly like native Go).

Profiling support

You can generate profiles using env vars:

MICROGPT_CPU_PROFILE=cpu.pprof MICROGPT_MEM_PROFILE=mem.pprof go run .

Inspect:

go tool pprof -top ./go_microgpt cpu.pprof
go tool pprof -top -alloc_space ./go_microgpt mem.pprof

Optimization summary

The codebase was optimized in multiple focused passes, keeping behavior the same while reducing allocations and runtime overhead.

Comparison to Karpathy gist

Reference: https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95

  • The original gist emphasizes algorithmic clarity first ("Everything else is just efficiency"), and this project follows that same core algorithmic structure.
  • The Go version then focuses on efficiency engineering while keeping the algorithm simple and dependency-free.

What is achieved now with Golang

  • Same tiny-GPT teaching model style as the gist, but with production-oriented runtime engineering.
  • Native CLI and browser/WASM execution paths in one codebase.
  • Profiling-driven optimization workflow (pprof) built into normal runs.
  • End-to-end speed improvement of about 7.9x on the same machine (23.40s -> 2.97s), i.e. about 87.31% less runtime.
  • Allocation profile reduction from about 11.3GB to about 1.28GB total alloc space in profiled runs.
  • Measured against the original Python gist on this machine, current optimized Go runtime is about 108.96x faster (5:23.61 vs 2.97s) for a full run:
    • about 99.08% less runtime, or
    • about 10,795.96% higher throughput-equivalent speed.
  • Python memory (measured with /usr/bin/time -v): peak RSS 60,220 KB (~58.8 MB) for the original gist run (5:29.06).
  • Metric note: Go memory figures above come from pprof alloc-space and are not directly the same metric as peak RSS.
  • Main lesson from this Go implementation: most gains came from graph/memory/layout optimizations (fused ops, pooling, scratch reuse), not from adding more goroutines alone.

Major optimizations applied

  1. Adaptive concurrency in GPT forward

    • Parallelized Q/K/V and attention heads only when work is large enough.
    • Avoided goroutine overhead on tiny workloads.
  2. Reduced attention allocations

    • Removed temporary per-head key/value slice construction.
    • Indexed directly into cached layer vectors.
  3. Precomputed layer parameter keys

    • Removed repeated fmt.Sprintf calls on hot token loops.
  4. Inference-only numeric fast path

    • Added no-autograd inference (float64) for sampling.
    • Avoided building autograd graphs during generation.
  5. Autograd graph memory optimization

    • Added sync.Pool for temporary Value nodes.
    • Replaced per-backward visited map with mark-based traversal.
    • Released graph nodes after each step.
  6. Tensor-style cross-entropy head during training

    • Replaced autograd Softmax->Log->Neg chain with numeric CE and direct logits gradients (probs - onehot).
  7. Fused autograd primitives

    • Added Dot(...) and WeightedSum(...) ops to collapse many Add/Mul nodes into single nodes.
  8. Pooled buffers for fused ops

    • Reused children/local-grad slices for common small sizes (8/16/32).
  9. Step-level scratch reuse

    • Reused keys/values/loss buffers/logit scratch across train and inference loops.
  10. Fused RMSNorm autograd

    • Implemented analytic RMSNorm local gradients in a single custom op path.
  11. Range-over-int cleanup

    • Updated counted loops to for i := range n style on Go 1.25.

Measured runtime comparison

Measurements below are from repeated end-to-end go run . runs on the same machine during optimization work. They are approximate and include run-to-run noise, but show the trend clearly.

Stage Elapsed
Karpathy original Python gist (this machine) 5:23.61
Early baseline 23.40s
Adaptive concurrency + key/allocation cleanup 22.27s
Numeric inference path 21.71s
Graph pooling + mark traversal 15.87s
Numeric CE training head 14.98s
Fused Dot/WeightedSum ops 3.17s
Buffer pooling + scratch reuse + fused RMSNorm 2.97s

Overall speedup from the recorded baseline: ~7.9x (23.40s -> 2.97s), i.e. about 87.31% runtime reduction.

Remaining improvement ideas

  • Tune GOGC (200/300) for this allocation profile.
  • Add microbenchmarks (go test -bench) for key kernels.
  • For bigger models: move to contiguous tensor core and/or BLAS/GPU backend.

About

MicroGPT rewriten and optimized in Golang by eSlider

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Go 41.6%
  • JavaScript 33.9%
  • Python 18.6%
  • HTML 4.9%
  • Shell 1.0%