contents (jump to step)
00welcome
Get anneal running, train a small transformer, sample from GPT-2.
This walkthrough takes about ten minutes of reading and a few minutes of waiting on a download. By the end you will have installed anneal, trained a tiny char-level transformer on Shakespeare end to end on your GPU, and produced a sample from GPT-2-small running entirely as Go on WebGPU. Along the way you'll see how these models are put together (the embedding table, attention heads, the autoregressive sampling loop) and how anneal compiles them into fused WGSL kernels. Each command surfaces one layer of either the model or the compiler.
Why these three demos in this order?
Training nanoGPT walks the full pipeline end to end: a tiny transformer learns to write Shakespeare-ish text from scratch, and you see how the forward graph, the gradient pass, and the scheduler all run through one immutable IR. Running GPT-2-small forward shows the same compiler handling a real production checkpoint from HuggingFace, with byte-level agreement on the next-token logits. Install and sanity check sit in front of both because the WebGPU adapter probe is the one thing worth confirming early.
01prereqs
A Go toolchain, a WebGPU adapter, and roughly 600 MB of disk.
anneal is zero-CGO and ships as a single static Go binary. You need Go 1.22 or newer (verify with go version), a WebGPU adapter on your platform (matrix below), about 600 MB of disk for the GPT-2 weights cache, and a network connection on first run so anneal can fetch tokenizer and weight assets.
Why WebGPU and not CUDA?
WebGPU is the cross-vendor compute substrate that runs on Metal, Vulkan, and DirectX 12 from a single shader language (WGSL). It lets the entire stack stay zero-CGO and lets the same WGSL the CLI emits also drive the in-browser visualizer.
02install
One command. One binary.
Install the CLI with go install. The binary lands in $GOPATH/bin (or $GOBIN if you set it); make sure that directory is on your $PATH.
Why go install and not a release binary?
Pinning to @latest with go install gives you a reproducible build of a specific module version, signed by the Go module proxy. The result is a single statically linked binary; no runtime dependency on the Go toolchain after that.
03sanity check
Confirm a WebGPU adapter is reachable.
anneal doctor probes for a WebGPU adapter on your platform and prints the device info it finds. Run this before anything else; it tells you in two seconds whether the rest of the tutorial can succeed.
What does doctor actually check?
It opens a real WebGPU instance, requests an adapter and device, and reports back the adapter name, backend type (Metal, Vulkan, or DX12), device features (notably shader-f16, which gates the f16 path), and the platform's max buffer size. No kernel runs; this is a connectivity probe.
04train nanogpt
Train a char-level transformer on Shakespeare, end to end, on your GPU.
anneal train nanogpt downloads the Tiny Shakespeare corpus (about 1 MB), constructs a small char-level transformer (vocab ~65, a few attention blocks, GELU MLPs), trains it for the requested number of steps, then prints a Shakespeare-flavored sample from the trained weights. The training loop runs through the standard tensor.Backward() + tensor.Realize() path, so what you see is the rangeify scheduler fusing the forward and backward passes into the same WGSL kernels.
nn.Embedding goes through one-hot × W in tinygrad; anneal uses a real Gather op, which keeps the backward cheaper.
Why does forward + backward fuse into the same kernel?
In anneal, gradients are graph-rewrite output, not closures over Python objects. The backward UOps live in the same DAG as the forward UOps. The rangeify scheduler indexes both passes by output range, and when the forward producer and the backward consumer share that range, the scheduler collapses them into one WGSL kernel: no intermediate buffer materializes, and the seam between the two passes vanishes.
A tape-based autograd cannot do this: it sees forward and backward as separate programs by construction. Because anneal's gradients live in the same graph, the boundary between the two passes is something the scheduler can see and decide to collapse.
Curious how it compiles? Open the visualizer to step through the same kind of forward / backward / fused-kernel pipeline.
05what just happened
A tour through the compiler that made the kernels.
Now that you've seen the model run, let's look at how anneal built the kernels. anneal explain shows you the rewrite rules that fire for a single op; anneal kernels shows you the final WGSL with fusion boundaries annotated; and the visualizer steps through every stage of the pipeline that produced them.
torch.compile backend layer, no extra trip through Triton. The CLI is already inside the compiler.
UOps, rangeify, and the .upat DSL in one sentence each
UOps: a single immutable, interned, arena-allocated IR node that represents every operation, forward and backward (uop/).
Rangeify: movement ops (reshape, permute, expand, pad, shrink, flip) become index arithmetic, never copies; the scheduler indexes every kernel by its output range (schedule/).
.upat DSL: per-op pattern files compiled at build time into match functions, so the rewrite hot path is reflection-free and inspectable (rewrite/).
06run gpt-2-small
Sample from the real GPT-2-small, end to end in Go.
anneal gpt2 sample fetches the HuggingFace GPT-2-small weights (model.safetensors, about 548 MB), the vocab (vocab.json, about 1 MB), and the merges (merges.txt, about 500 KB), all SHA-pinned and atomically downloaded into $ANNEAL_CACHE_DIR. It then runs forward inference, BPE-decodes the output, and prints a sample.
Why forward-only, and where does this fit?
Training GPT-2 from scratch needs OpenWebText (about 40 GB), a multi-day budget, and a tuned schedule; that is out of scope for v1. Forward inference gives you a clean side-by-side against a reference implementation: the same weights drive HuggingFace and anneal, and the outputs should agree (to f32 noise).
The BPE encoder is pure Go (no tiktoken dependency). The safetensors loader is pure Go (no Python shim). The embedding lookup at the front of the model is the same Gather op from the nanoGPT demo. Nothing changes at the compiler level: this is the same pipeline, with bigger weights.
07going further
Explore the compiler interactively.
From here the most useful surfaces are the visualizer, the graph dump, and the source.
For a static walkthrough of an example pipeline without spinning up anneal viz, see the visualizer demo. To contribute, read CONTRIBUTING.md and the architecture spec at SPEC.md. The source lives at github.com/georgebuilds/anneal.
08troubleshooting
Common failure modes, and what to do about them.
No WebGPU adapter detected
On macOS Apple Silicon the WebGPU adapter is built in; if anneal doctor reports no adapter, your Go build likely didn't link the Metal path. Reinstall with go install on the same machine you intend to run on.
On Linux, install a Vulkan-capable driver for your GPU and re-run anneal doctor. On Windows, ensure DX12 is available.
Asset downloads keep retrying / network is sketchy
Set ANNEAL_OFFLINE=1 to fail closed when the cache is empty, so you don't waste a retry budget. Pre-populate $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal) from a machine that has network. The cache directory layout is stable.
SHA mismatch on a downloaded asset
This is fail-closed by design. The likely cause is a corrupted partial download (network was cut mid-stream). Delete the offending file in $ANNEAL_CACHE_DIR and re-run; anneal does not silently accept mismatched assets.
Training is much slower than the docs imply
First-run compile time is dominated by WGSL pipeline creation and (optionally) BEAM autotuning. Subsequent runs hit the disk cache and start in well under a second. On Linux + wgpu, expect ~2x the latency of an M3 Mac because the Vulkan path is less tuned and the driver layer is thicker.
Linux + wgpu setup notes
You need an up-to-date Mesa or a vendor Vulkan driver. vulkaninfo should list at least one device; anneal doctor will then find it. If vulkaninfo works but anneal reports no adapter, file an issue with the full anneal doctor output.
qaquestions
Frequently asked.
Grouped by topic. Click to expand.
install & environment
What if I have no GPU?
anneal run will fail at device probe.What if I'm on Windows?
anneal doctor output attached.Why Go, not Python?
nanogpt
Why char-level instead of BPE?
How long does training take?
Can I change the model size?
--steps, --batch, and --seed. Architectural knobs (layers, heads, dmodel) live in the example source, not on the command line, because the goal here is a stable reference, not a hyperparam playground.gpt-2
Why no GPT-2 fine-tuning or training?
Where do the GPT-2 weights come from?
gpt2 model repository: model.safetensors, vocab.json, merges.txt. Each asset is SHA-pinned in code and atomically downloaded into the cache.Why 548 MB? Can I get smaller?
anneal doctor); when it lands, the same checkpoint will load at half the size.