georgebuilds anneal tutorial
contents (jump to step)
  1. 00. welcome
  2. 01. prereqs
  3. 02. install
  4. 03. sanity check
  5. 04. train nanoGPT
  6. 05. compiler tour
  7. 06. run GPT-2-small
  8. 07. going further
  9. 08. troubleshooting
  10. qa

00welcome

Get anneal running, train a small transformer, sample from GPT-2.

This walkthrough takes about ten minutes of reading and a few minutes of waiting on a download. By the end you will have installed anneal, trained a tiny char-level transformer on Shakespeare end to end on your GPU, and produced a sample from GPT-2-small running entirely as Go on WebGPU. Along the way you'll see how these models are put together (the embedding table, attention heads, the autoregressive sampling loop) and how anneal compiles them into fused WGSL kernels. Each command surfaces one layer of either the model or the compiler.

Why these three demos in this order?

Training nanoGPT walks the full pipeline end to end: a tiny transformer learns to write Shakespeare-ish text from scratch, and you see how the forward graph, the gradient pass, and the scheduler all run through one immutable IR. Running GPT-2-small forward shows the same compiler handling a real production checkpoint from HuggingFace, with byte-level agreement on the next-token logits. Install and sanity check sit in front of both because the WebGPU adapter probe is the one thing worth confirming early.

01prereqs

A Go toolchain, a WebGPU adapter, and roughly 600 MB of disk.

anneal is zero-CGO and ships as a single static Go binary. You need Go 1.22 or newer (verify with go version), a WebGPU adapter on your platform (matrix below), about 600 MB of disk for the GPT-2 weights cache, and a network connection on first run so anneal can fetch tokenizer and weight assets.

primary
macOS (M-series)
Apple Silicon (M1, M2, M3, M4). WebGPU routes through Metal. The most exercised path; fastest first-run.
supported
Linux + wgpu
Vulkan-capable GPU via wgpu-native. Works; less exercised than macOS. Driver setup is on you.
experimental
Windows + wgpu
DirectX 12 path via wgpu-native. Builds; report issues on GitHub if you hit them.
vs PyTorch No CUDA install, no driver-matched wheels, no virtualenv. The Go toolchain plus your platform's WebGPU adapter is the whole prereq list.
Why WebGPU and not CUDA?

WebGPU is the cross-vendor compute substrate that runs on Metal, Vulkan, and DirectX 12 from a single shader language (WGSL). It lets the entire stack stay zero-CGO and lets the same WGSL the CLI emits also drive the in-browser visualizer.

02install

One command. One binary.

Install the CLI with go install. The binary lands in $GOPATH/bin (or $GOBIN if you set it); make sure that directory is on your $PATH.

$ go install github.com/georgebuilds/anneal/cmd/anneal@latest
go: downloading github.com/georgebuilds/anneal v0.1.0
$ anneal --version
anneal v0.1.0 (commit e66de45, go1.22)
vs pip install torch + CUDA setup anneal is a single static Go binary. No driver-matched wheels, no virtualenv to keep alive, no Python on the runtime side.
Why go install and not a release binary?

Pinning to @latest with go install gives you a reproducible build of a specific module version, signed by the Go module proxy. The result is a single statically linked binary; no runtime dependency on the Go toolchain after that.

03sanity check

Confirm a WebGPU adapter is reachable.

anneal doctor probes for a WebGPU adapter on your platform and prints the device info it finds. Run this before anything else; it tells you in two seconds whether the rest of the tutorial can succeed.

$ anneal doctor
webgpu adapter: Apple M3 Max (metal) backend type: Metal device features: shader-f16, timestamp-query max buffer size: 2147483648 (2 GiB) status: ok
What does doctor actually check?

It opens a real WebGPU instance, requests an adapter and device, and reports back the adapter name, backend type (Metal, Vulkan, or DX12), device features (notably shader-f16, which gates the f16 path), and the platform's max buffer size. No kernel runs; this is a connectivity probe.

04train nanogpt

Train a char-level transformer on Shakespeare, end to end, on your GPU.

anneal train nanogpt downloads the Tiny Shakespeare corpus (about 1 MB), constructs a small char-level transformer (vocab ~65, a few attention blocks, GELU MLPs), trains it for the requested number of steps, then prints a Shakespeare-flavored sample from the trained weights. The training loop runs through the standard tensor.Backward() + tensor.Realize() path, so what you see is the rangeify scheduler fusing the forward and backward passes into the same WGSL kernels.

$ anneal train nanogpt --steps=2000
fetch shakespeare.txt (~1 MiB) cached model nanogpt-char vocab=65 layers=4 heads=4 dmodel=128 device webgpu · apple m3 max step 0001 loss 4.17 step 0500 loss 2.38 step 2000 loss 1.52 forward 62 uops backward 49 uopsfused 7 kernels sample: ROMEO: but soft, what light through yonder window breaks? it is the east, and Juliet is the sun.
FUSED WGSL KERNEL x matmul gelu loss dL dgelu dW seam
Teal forward, ember backward, gold marks the fused region. The scheduler collapses both chains into one WGSL kernel where the indexing model agrees.
vs PyTorch nanoGPT About 150 lines vs about 300, because the optimizer and autograd are graph passes, not a separate framework around the model.
vs tinygrad nn.Embedding goes through one-hot × W in tinygrad; anneal uses a real Gather op, which keeps the backward cheaper.
Why does forward + backward fuse into the same kernel?

In anneal, gradients are graph-rewrite output, not closures over Python objects. The backward UOps live in the same DAG as the forward UOps. The rangeify scheduler indexes both passes by output range, and when the forward producer and the backward consumer share that range, the scheduler collapses them into one WGSL kernel: no intermediate buffer materializes, and the seam between the two passes vanishes.

A tape-based autograd cannot do this: it sees forward and backward as separate programs by construction. Because anneal's gradients live in the same graph, the boundary between the two passes is something the scheduler can see and decide to collapse.

Curious how it compiles? Open the visualizer to step through the same kind of forward / backward / fused-kernel pipeline.

05what just happened

A tour through the compiler that made the kernels.

Now that you've seen the model run, let's look at how anneal built the kernels. anneal explain shows you the rewrite rules that fire for a single op; anneal kernels shows you the final WGSL with fusion boundaries annotated; and the visualizer steps through every stage of the pipeline that produced them.

$ anneal explain gather
forward: indexed load over an embedding table. backward: deterministic sort + segment-sum into the index axis. rules fired: gather.fold-identity, gather.lift-permute
$ anneal kernels nanogpt
kernel 0/7 attn.qkv // fused: matmul + add + reshape kernel 3/7 mlp.gelu // fused: matmul + gelu + dropout kernel 6/7 loss.ce // fused: log_softmax + nll + grad seed
vs PyTorch There is no FX graph capture, no torch.compile backend layer, no extra trip through Triton. The CLI is already inside the compiler.
UOps, rangeify, and the .upat DSL in one sentence each

UOps: a single immutable, interned, arena-allocated IR node that represents every operation, forward and backward (uop/).

Rangeify: movement ops (reshape, permute, expand, pad, shrink, flip) become index arithmetic, never copies; the scheduler indexes every kernel by its output range (schedule/).

.upat DSL: per-op pattern files compiled at build time into match functions, so the rewrite hot path is reflection-free and inspectable (rewrite/).

06run gpt-2-small

Sample from the real GPT-2-small, end to end in Go.

anneal gpt2 sample fetches the HuggingFace GPT-2-small weights (model.safetensors, about 548 MB), the vocab (vocab.json, about 1 MB), and the merges (merges.txt, about 500 KB), all SHA-pinned and atomically downloaded into $ANNEAL_CACHE_DIR. It then runs forward inference, BPE-decodes the output, and prints a sample.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy
fetch model.safetensors (548 MiB) sha ok fetch vocab.json (1 MiB) sha ok fetch merges.txt (445 KiB) sha ok prompt: Hello, world mode: greedy · max-tokens: 10 · temperature: 1.00 · top-k: 40 --- Hello, world. I'm sorry, but I'm
vs transformers.AutoModel.from_pretrained('gpt2') Same model class, no Python, no CGO, no shim. The weights load through the pure-Go safetensors reader; the BPE tokenizer is pure Go too.
Why forward-only, and where does this fit?

Training GPT-2 from scratch needs OpenWebText (about 40 GB), a multi-day budget, and a tuned schedule; that is out of scope for v1. Forward inference gives you a clean side-by-side against a reference implementation: the same weights drive HuggingFace and anneal, and the outputs should agree (to f32 noise).

The BPE encoder is pure Go (no tiktoken dependency). The safetensors loader is pure Go (no Python shim). The embedding lookup at the front of the model is the same Gather op from the nanoGPT demo. Nothing changes at the compiler level: this is the same pipeline, with bigger weights.

07going further

Explore the compiler interactively.

From here the most useful surfaces are the visualizer, the graph dump, and the source.

$ anneal viz # live visualizer in your browser (WASM)
$ anneal graph mlp # dump the UOp DAG for a model
$ anneal explain add # trace rewrite rules for one op

For a static walkthrough of an example pipeline without spinning up anneal viz, see the visualizer demo. To contribute, read CONTRIBUTING.md and the architecture spec at SPEC.md. The source lives at github.com/georgebuilds/anneal.

08troubleshooting

Common failure modes, and what to do about them.

No WebGPU adapter detected

On macOS Apple Silicon the WebGPU adapter is built in; if anneal doctor reports no adapter, your Go build likely didn't link the Metal path. Reinstall with go install on the same machine you intend to run on.

On Linux, install a Vulkan-capable driver for your GPU and re-run anneal doctor. On Windows, ensure DX12 is available.

Asset downloads keep retrying / network is sketchy

Set ANNEAL_OFFLINE=1 to fail closed when the cache is empty, so you don't waste a retry budget. Pre-populate $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal) from a machine that has network. The cache directory layout is stable.

SHA mismatch on a downloaded asset

This is fail-closed by design. The likely cause is a corrupted partial download (network was cut mid-stream). Delete the offending file in $ANNEAL_CACHE_DIR and re-run; anneal does not silently accept mismatched assets.

Training is much slower than the docs imply

First-run compile time is dominated by WGSL pipeline creation and (optionally) BEAM autotuning. Subsequent runs hit the disk cache and start in well under a second. On Linux + wgpu, expect ~2x the latency of an M3 Mac because the Vulkan path is less tuned and the driver layer is thicker.

Linux + wgpu setup notes

You need an up-to-date Mesa or a vendor Vulkan driver. vulkaninfo should list at least one device; anneal doctor will then find it. If vulkaninfo works but anneal reports no adapter, file an issue with the full anneal doctor output.

qaquestions

Frequently asked.

Grouped by topic. Click to expand.

install & environment

What if I have no GPU?
A WebGPU-capable adapter is required by design. anneal does not ship a CPU execution path; the assumption that every program is a WGSL kernel is wired into the scheduler. If you only have CPU, you can still build, but anneal run will fail at device probe.
What if I'm on Windows?
The wgpu-native DirectX 12 path works but is the least exercised of the three platforms. If something breaks, file an issue on the GitHub tracker with anneal doctor output attached.
Why Go, not Python?
Zero-CGO is the deciding constraint: one static binary, no env management, no pip install, no driver-matched wheels. The visualizer ships as the same Go code compiled to WASM, so the live demo runs the real compiler.

nanogpt

Why char-level instead of BPE?
A vocab of about 65 keeps the embedding tiny and removes the tokenizer as a dependency. The point of this demo is forward + backward fusion across the scheduler seam, not language modeling. BPE comes in for GPT-2 because that is the format the released weights expect.
How long does training take?
Roughly a few minutes for 2000 steps on an M3 Max; about 2x that on a typical Linux + wgpu box. Numbers vary with kernel cache warmth.
Can I change the model size?
The flags exposed in v1 are --steps, --batch, and --seed. Architectural knobs (layers, heads, dmodel) live in the example source, not on the command line, because the goal here is a stable reference, not a hyperparam playground.

gpt-2

Why no GPT-2 fine-tuning or training?
Training from scratch needs OpenWebText (about 40 GB) and a multi-day budget on a single device; fine-tuning needs a curated dataset and a stable evaluation. Both are out of scope for v1. Forward inference gives you a side-by-side check against HuggingFace, which is what matters most for the model class: same weights, same outputs.
Where do the GPT-2 weights come from?
From HuggingFace's gpt2 model repository: model.safetensors, vocab.json, merges.txt. Each asset is SHA-pinned in code and atomically downloaded into the cache.
Why 548 MB? Can I get smaller?
That is FP32 weight storage. F16 storage is on the roadmap (the device feature is already gated through anneal doctor); when it lands, the same checkpoint will load at half the size.
Can I run a different model size?
Not in v1. The architecture and weight layout are pinned to GPT-2-small to keep the compiler-correctness check tight against a single reference. Medium / large are deferred.

compiler concepts

What is a UOp?
A UOp is one immutable IR node: an op kind, a dtype, an output shape, and pointers to its source UOps. They live in an arena, they are interned, and they are never mutated. Structural equality is identity equality, so deep comparison disappears from the hot path. See uop/.
What is rangeify?
A scheduling model where every kernel is indexed by its output range, and movement ops (reshape, permute, expand, pad, shrink, flip) are absorbed as index arithmetic rather than data copies. The only thing that materializes a buffer is the scheduler when it decides to. See SPEC.md for the full design.
How is anneal's gather different from PyTorch's?
Forward is the same: an indexed load. The backward is what differs: anneal lowers to deterministic sort + segment-sum into the index axis, so gradients are stable and cheap. tinygrad emulates gather via one-hot × W, which is correct but pays an O(VxN) tensor multiply on the backward; anneal avoids that.