gbgeorgebuilds anneal learn

>pick a model

course contents (jump to a section)

start here
pick a lesson
about the model
01. setup
02. install
03. check gpu
04. train it
05. look inside
06. also try
07. go further
08. fix-it
qa

a hands-on course

Train real networks from scratch, then look inside the compiler.

anneal is a machine learning compiler written in Go. This course teaches you to train neural networks end to end on your own GPU, with no Python and no CUDA, and then to open up the compiler and watch how it turns your model into GPU code. You learn the network and the machine underneath it at the same time, because here they are the same artifact.

1 pick a model to build 2 follow the path: install, train, sample 3 look inside the kernels it compiled

what you will be able to do

Train a network end to end on a real GPU from a single Go binary.
Read the WGSL kernels the compiler emitted, with the fusion boundaries marked.
Explain how autodiff and kernel fusion fall out of one immutable graph.
Recognise the modern transformer stack (RMSNorm, RoPE, GQA, SwiGLU) in working code.

New here? The MLP is the fastest first run and the cleanest look at forward and backward fusing into one kernel. Pick a lesson below, or jump straight into setup. Already set up and want the payoff? Go to train it. The model picker at the top follows you down the page, so you can switch models at any time and every step re-adapts.

pick a lesson

Each lesson trains a model and opens up the compiler.

Roughly easiest to hardest. Pick one to set it as your path through the course; the picker at the top switches between them anytime. New lessons land here as they ship.

A 2 → 8 → 1 perceptron learns y = x₁² + x₂² in seconds.

forward + backward fuse into one kernel
the live training TUI

time secondsdata none

start lesson →

A small conv net: conv2d → relu → shrink → flatten → linear.

movement ops are free index math
im2col + one matmul

time secondsdata none

start lesson →

A char-level transformer trains on Shakespeare and writes some back.

attention, end to end
embedding as a real Gather op

time minutesdata ~1 MB

start lesson →

Llama-style decoder

The modern small-LM stack, from scratch, on the same dataset as nanoGPT.

RMSNorm, RoPE, GQA, SwiGLU
tied input/output embeddings

time minutesdata ~1 MB

start lesson →

A bidirectional encoder trained by masked language modeling on Shakespeare.

non-causal attention reads left and right
masked-LM loss on hidden tokens

time minutesdata ~1 MB

start lesson →

A tiny vision transformer classifies 32×32 RGB images.

patch embedding as reshapes
the encoder stack on images

time secondsdata none

start lesson →

Real 3×3 convolutions, residual blocks, and BatchNorm on CIFAR-10.

im2col-as-one-matmul lowering
residual + ReLU epilogue fusion

data ~170 MBnote forward + FD-tested

start lesson →

Run and fine-tune the real HuggingFace weights, in pure Go.

load safetensors, tied-embed gradient
AdamW + a JIT-replayed step

data ~550 MBtime ~40s/step

start lesson →

A small GPT whose feedforward is a mixture of expert networks with a router.

soft routing over expert FFNs
a load-balance auxiliary loss

time minutesdata ~1 MB

start lesson →

A Diffusion Transformer denoises CIFAR-10 with adaLN-zero conditioning.

patchify + adaLN-zero blocks
classifier-free guidance

data ~160 MBdevice GPU

start lesson →

A one-step generative model on CIFAR-10, on the same DiT backbone.

average velocity, sampled in one step
forward-mode autodiff (a JVP target)

data ~160 MBdevice GPU

start lesson →

more on the way

More architectures are on the roadmap. New lessons slot in here as they ship.

watch the repo ↗

about this network

What you are about to train.

This is the architecture, in plain terms, before you build it. It tracks the model you picked above: switch the picker and this description re-adapts to whichever network is selected.

MLP

The multilayer perceptron is the foundational feedforward network: layers of fully-connected units, where every input connects to every output, stacked with a nonlinear activation between them. The key is that nonlinearity (a simple function like ReLU, which clips negative values to zero): without it, stacked linear layers would collapse into a single linear map, but with it the network can bend and fold its input to fit almost any continuous function. This is the universal approximation property, and it makes the MLP the "hello world" of deep learning. Picture each layer reshaping the data until the classes pull apart.

ConvNet

A convolutional network processes grid-shaped data such as images using filters: small kernels of weights that slide across the input, computing a local response at every position. Its key idea is weight sharing: because the same kernel scans the whole image, a feature learned in one spot is detected everywhere, and the layer uses far fewer parameters than a fully-connected one of equal reach. This matches the nature of images, where nearby pixels relate and a pattern can appear at any location, and it is what first made image recognition practical at scale. An edge detector stays useful no matter where the edge falls.

nanoGPT

nanoGPT is a small decoder-only GPT, a causal transformer that reads a stream of text and learns to predict the next token (here, the next character) from everything that came before it. The key mechanism is masked self-attention: at each position the model weighs earlier tokens to decide what matters, but a causal mask hides every future token so prediction never peeks ahead. This is the same recipe behind modern chat models, stripped to its teaching essentials by Andrej Karpathy. The intuition is simple: train on enough text and "what comes next?" gradually turns into spelling, grammar, and style, one character at a time.

Llama-style decoder

The Llama-style decoder is a decoder-only language model that keeps GPT's next-token objective but updates almost every component around it. RMSNorm normalizes activations by their root-mean-square for cheaper, steadier training; rotary position embeddings (RoPE) encode order by rotating the query and key vectors by an angle that grows with position; grouped-query attention lets several attention heads share one set of key/value vectors to save memory; and SwiGLU, a gated feedforward layer, replaces the plain one. Input and output embeddings are tied, sharing a single weight matrix. None of these change what the model does; together they are the 2023-onward refinement that made large decoders cheaper to train and run.

BERT

BERT is an encoder-only transformer, meaning it is built to read and represent text rather than generate it. It trains by masked language modeling: some tokens are randomly hidden, and the model must recover each one using context from both sides at once, left and right. That bidirectional view is the contrast with GPT, which only ever looks leftward. BERT's 2018 release showed that a single model pretrained this way could be fine-tuned to top many language-understanding tasks (classification, question answering, tagging), making transfer learning the default for NLP. The intuition is fill-in-the-blank: to guess a missing word well, you have to understand the whole sentence.

ViT

The Vision Transformer brings the transformer, an architecture built for text, to images. It cuts an image into a grid of fixed-size patches, treats each patch as a token (the way a word is a token in language), embeds it as a vector, and feeds the sequence through a standard transformer encoder. The engine there is self-attention: every patch can look directly at every other patch and weigh how much each one matters, rather than building up context through local steps. Its lesson is that, given enough training data, attention can stand in for convolution in vision. In effect, the model reads a picture as a sentence of patches.

ResNet-9

ResNet-9 is a compact, nine-layer convolutional network built from residual blocks. Its defining idea is the residual, or skip, connection: a block adds its own input back to its output, so it learns only the change to apply rather than a full transformation. This matters because the skip gives gradients (the error signals used in training) a direct path backward, which keeps deep networks from stalling. Residual connections (He and colleagues, 2015) made networks of hundreds of layers trainable. This nine-layer recipe, from David Page, is a well-known fast way to train a CIFAR-10 image classifier, and the skip acts as a shortcut the gradient can always take home.

GPT-2-small

This network is GPT-2-small, the 124-million-parameter version of OpenAI's 2019 model, but instead of training it from scratch you begin from the released pretrained weights and keep training on new text. That move is called fine-tuning, an instance of transfer learning: a model first learns general structure from a large corpus, then adapts to a smaller, specific task. It matters because pretraining is expensive and data-hungry, while most real problems are neither. The intuition is that the model already knows English grammar and broad facts; fine-tuning only nudges those existing weights toward the vocabulary, tone, and patterns of your dataset.

MoE

A Mixture of Experts (MoE) replaces the single feedforward network inside a transformer block with many parallel "expert" networks plus a small router that decides which experts handle each token. The point is to grow the total parameter count, and so the model's capacity, while keeping the compute spent on any one token roughly fixed. Large models do this with sparse routing: each token is sent to only its top few experts. A teaching version can instead softly weight all experts by the router's scores, which is simpler and still shows the core idea. The intuition is specialization: over training, different experts come to handle different kinds of tokens.

DiT

DiT is a diffusion model whose denoiser is a transformer instead of the convolutional U-Net of earlier diffusion models. The noisy image is cut into small patches, each treated as a token (a unit in a sequence) like words in a sentence, and the transformer attends across all patches at once. The noise level and the class being generated enter through adaptive layer normalization (adaLN): that conditioning rescales and shifts the activations inside each block. It matters because diffusion inherits the transformer's clean scaling, where adding parameters and compute reliably improves samples. DiT (Peebles and Xie, 2022) is the backbone behind many recent image and video generators.

MeanFlow

MeanFlow (Geng and colleagues, 2025) is a generative model built to turn noise into a sample in a single step. Flow-based models learn a velocity field, a rule for how to move a point from noise toward data, which you normally follow through many small updates along a curved path. MeanFlow instead learns the average velocity over the whole trip, so one evaluation gives the straight-shot displacement from noise directly to a sample. The cost is a more involved training objective that ties this average to the instantaneous velocity. It matters because sampling drops from many network passes to one, the main bottleneck in diffusion-style generation.

01setup

What you need: a Go toolchain, a WebGPU adapter, a little disk.

anneal is zero-CGO and ships as a single static Go binary. You need Go 1.26.3 or newer (verify with go version), a WebGPU adapter on your platform (matrix below), and, for the models that train on real data, a network connection on first run so anneal can fetch the corpus or weights into its asset cache. The card below is tailored to the model you picked.

disk / network for MLP None beyond the Go install. The MLP uses a 16-sample synthetic dataset baked into the binary; no downloads, no cache writes.

disk / network for ConvNet None beyond the Go install. The ConvNet uses a synthetic 6 by 6 spatial dataset baked into the binary; no downloads, no cache writes.

disk / network for nanoGPT About 1 MB for the Tiny Shakespeare corpus on first run, cached in $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal). Subsequent runs are offline.

disk / network for Llama About 1 MB for the Tiny Shakespeare corpus on first run (the same corpus nanoGPT uses), cached in $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal). Subsequent runs are offline.

disk / network for BERT About 1 MB for the Tiny Shakespeare corpus on first run (the same corpus nanoGPT uses), cached in $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal). Subsequent runs are offline.

disk / network for ViT None beyond the Go install. The Vision Transformer trains on a synthetic class-conditional 32 by 32 RGB dataset generated in-process; no downloads, no cache writes.

disk / network for ResNet-9 About 170 MB on first run: the official CIFAR-10 binary distribution (cifar-10-binary.tar.gz) is fetched into $ANNEAL_CACHE_DIR, SHA-pinned and atomically written. Subsequent runs are offline; set ANNEAL_OFFLINE=1 to fail closed when the cache is cold. The tarball stays gzipped in cache and is streamed through archive/tar on every load (no extraction step).

disk / network for GPT-2 About 550 MB on first run: model.safetensors (~548 MB), vocab.json (~1 MB), and merges.txt (~445 KB), SHA-pinned and atomically written to $ANNEAL_CACHE_DIR. Subsequent runs are offline; set ANNEAL_OFFLINE=1 to fail closed when the cache is cold.

disk / network for MoE About 1 MB for the Tiny Shakespeare corpus on first run (the same corpus nanoGPT uses), cached in $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal). Subsequent runs are offline.

disk / network for DiT About 160 MB on first run: the CIFAR-10 binary distribution is fetched into $ANNEAL_CACHE_DIR, SHA-pinned and atomically written (the same dataset ResNet-9 uses). Subsequent runs are offline; set ANNEAL_OFFLINE=1 to fail closed when the cache is cold.

disk / network for MeanFlow About 160 MB on first run: the CIFAR-10 binary distribution is fetched into $ANNEAL_CACHE_DIR, SHA-pinned and atomically written (the same dataset DiT and ResNet-9 use). Subsequent runs are offline; set ANNEAL_OFFLINE=1 to fail closed when the cache is cold.

primary

macOS (M-series)

Apple Silicon (M1, M2, M3, M4). WebGPU routes through Metal. The most exercised path; fastest first-run.

supported

Linux + wgpu

Vulkan-capable GPU via wgpu-native. Works; less exercised than macOS. Driver setup is on you.

experimental

Windows + wgpu

DirectX 12 path via wgpu-native. Builds; report issues on GitHub if you hit them.

vs PyTorch No CUDA install, no driver-matched wheels, no virtualenv. The Go toolchain plus your platform's WebGPU adapter is the whole prereq list.

Why WebGPU and not CUDA?

WebGPU is the cross-vendor compute substrate that runs on Metal, Vulkan, and DirectX 12 from a single shader language (WGSL). It lets the entire stack stay zero-CGO and lets the same WGSL the CLI emits also drive the in-browser visualizer.

02install

One command, one binary.

Install the CLI with go install. The binary lands in $GOPATH/bin (or $GOBIN if you set it); make sure that directory is on your $PATH.

$ go install github.com/georgebuilds/anneal/cmd/anneal@latest

go: downloading github.com/georgebuilds/anneal v0.2.1

$ anneal --version

anneal v0.2.1 (commit e2bfee1, go1.26.3)

vs pip install torch + CUDA setup anneal is a single static Go binary. No driver-matched wheels, no virtualenv to keep alive, no Python on the runtime side.

Now confirm the binary can list its model registry. You should see the model you picked in the output:

$ anneal train

usage: anneal train <model> [--steps=N] [--lr=F] [--log-every=N] [--batch=N] [--plain] available models: mlp 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear dynmlp 2 to 8 to 1 MLP with symbolic batch dim (same task as mlp) nanogpt char-level transformer; trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: mlp 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear dynmlp 2 to 8 to 1 MLP with symbolic batch dim (same task as mlp) nanogpt char-level transformer; trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: mlp 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear nanogpt char-level transformer; trains on tinyshakespeare llama Llama-style decoder (RMSNorm, GQA + RoPE, SwiGLU); trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: mlp 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear nanogpt char-level transformer; trains on tinyshakespeare llama Llama-style decoder (RMSNorm, GQA + RoPE, SwiGLU); trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear nanogpt char-level transformer; trains on tinyshakespeare llama Llama-style decoder (RMSNorm, GQA + RoPE, SwiGLU); trains on tinyshakespeare bert BERT encoder (bidirectional attention + masked-LM); trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: mlp 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear nanogpt char-level transformer; trains on tinyshakespeare llama Llama-style decoder (RMSNorm, GQA + RoPE, SwiGLU); trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: mlp 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² conv conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear nanogpt char-level transformer; trains on tinyshakespeare llama Llama-style decoder (RMSNorm, GQA + RoPE, SwiGLU); trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head) resnet9 ResNet-9 on CIFAR-10 (David Page architecture, 6.57M params)

$ anneal gpt2

usage: anneal gpt2 <subcommand> [flags] subcommands: sample <prompt> sample text from GPT-2-small (HuggingFace weights) run 'anneal gpt2 sample --help' for sample-specific flags.

$ anneal train

available models: nanogpt char-level transformer; trains on tinyshakespeare llama Llama-style decoder (RMSNorm, GQA + RoPE, SwiGLU); trains on tinyshakespeare bert BERT encoder (bidirectional attention + masked-LM); trains on tinyshakespeare moe Mixture-of-Experts LM (router + expert FFNs, soft routing + load-balance loss); trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head)

$ anneal train

available models: nanogpt char-level transformer; trains on tinyshakespeare vit vision transformer (patch embed + 2-block encoder + mean-pool head) dit Diffusion Transformer (adaLN-zero, classifier-free guidance) on CIFAR-10

$ anneal train

available models: vit vision transformer (patch embed + 2-block encoder + mean-pool head) dit Diffusion Transformer (adaLN-zero, classifier-free guidance) on CIFAR-10 meanflow MeanFlow one-step generative model (average-velocity, forward-mode JVP) on CIFAR-10

Why go install and not a release binary?

Pinning to @latest with go install gives you a reproducible build of a specific module version, signed by the Go module proxy. The result is a single statically linked binary; no runtime dependency on the Go toolchain after that.

03check gpu

Confirm a WebGPU adapter is reachable.

anneal doctor probes for a WebGPU adapter on your platform and prints the device info it finds. Run this before anything else; it tells you in two seconds whether the rest of the course can succeed.

For MLP, the only line that matters is status: ok. Any backend (Metal, Vulkan, DX12) works; the model fits in any device's buffer limits.

For ConvNet, the only line that matters is status: ok. Any backend works; the network is small enough to fit anywhere.

For nanoGPT, you want status: ok and a reasonable max buffer size (anything in the GB range is fine). The model is small; the wall-time in step 04 is dominated by kernel compile and dispatch, not buffer pressure.

For Llama, you want status: ok and a reasonable max buffer size (anything in the GB range is fine). The model is the same size as nanoGPT; the wall-time in step 04 is dominated by kernel compile and dispatch, not buffer pressure.

For BERT, you want status: ok and a reasonable max buffer size (anything in the GB range is fine). The model is the same size as nanoGPT; the wall-time in step 04 is dominated by kernel compile and dispatch, not buffer pressure.

For ViT, the only line that matters is status: ok. Any backend works; the model fits in any device's buffer limits (about 80K parameters total, dominated by the patch projection and two encoder blocks).

For ResNet-9, check that max buffer size comfortably exceeds the largest activation: at the canonical 64/128/256/512 channels, the layer3 stage produces a [B, 512, 4, 4] tensor (32 KB per sample, 1 MB at B=32). The whole model and its activations fit easily on any discrete GPU; the binding constraint at present is the WGSL codegen on very-deep backward graphs, not buffer pressure.

For GPT-2, check that max buffer size is comfortably above the largest single weight tensor (the FP32 embedding is about 154 MB). On Apple Silicon and most discrete GPUs this is automatic; on low-end integrated GPUs it can be the binding constraint.

For MoE, you want status: ok and a reasonable max buffer size (anything in the GB range is fine). The model is the same size as nanoGPT with several parallel expert FFNs; the wall-time in step 04 is dominated by kernel compile and dispatch, not buffer pressure.

For DiT, you want status: ok on a real adapter: training runs on the GPU only. The diffusion backward pass does not yet realize on the pure-Go CPU interpreter (a known backend limitation), so --device=cpu covers the forward path but not anneal train dit. The binding constraint is the WGSL backward surface, not buffer size.

For MeanFlow, you want status: ok on a real adapter: training runs on the GPU only. It reuses the DiT backbone (whose backward does not yet realize on the pure-Go CPU interpreter), and the continuous-time embedding adds a Sin the CPU path does not realize either, so anneal train meanflow is GPU-only. The binding constraint is the WGSL surface, not buffer size.

$ anneal doctor

webgpu adapter: Apple M3 Max (metal) backend type: Metal device features: shader-f16, timestamp-query max buffer size: 2147483648 (2 GiB) status: ok

no GPU? A WebGPU adapter is required for the GPU path. You can still follow along on the pure-Go CPU interpreter: add --device=cpu to any anneal train command. It is slower, but it runs anywhere and is the value oracle the GPU path is checked against.

What does doctor actually check?

It opens a real WebGPU instance, requests an adapter and device, and reports back the adapter name, backend type (Metal, Vulkan, or DX12), device features (notably shader-f16, which gates the f16 path), and the platform's max buffer size. No kernel runs; this is a connectivity probe.

04train it

Train your model, live in your terminal.

This is the payoff: a real network, training on your GPU, from one Go binary. The command and the numbers below are tailored to the model you picked in the catalog. A live terminal UI shows the loss curve and the real compiler stats (UOp counts, kernel counts, fused regions) as it runs.

the idea

In most frameworks, backpropagation is a separate machine: a tape that records the forward pass and replays it backward at runtime. In anneal the gradient is just more graph. Autodiff is a compiler pass that walks the forward UOp graph and emits backward UOps into the same immutable graph. Because both passes live in one graph, the scheduler can see the seam between them and fuse forward and backward work into the same GPU kernel. That is the single idea you will watch happen in the next step.

go deeper: SPEC.md on the rangeify scheduler and the gradient pass, and tensor/ for the autodiff source.

MLP: a 2 to 8 to 1 net on a synthetic task.

anneal train mlp opens a live TUI dashboard and trains a 2 to 8 to 1 MLP on a fixed 16-sample dataset for y = x₁² + x₂². The loop runs through the standard tensor.Backward() plus tensor.Realize() path, so what you see is the rangeify scheduler fusing the forward and backward passes into the same WGSL kernels. Pass --plain to disable the TUI and dump loss lines to stdout (useful in CI / pipes).

$ anneal train mlp --steps=200

training mlp: 2 to 8 to 1 multilayer perceptron; trains on y = x₁² + x₂² device: metal (apple m3 max) steps: 200 · lr: 0.050 · batch: 16 step 000 loss 0.4218 step 050 loss 0.0613 step 100 loss 0.0148 step 200 loss 0.0031 forward ~18 uops backward ~15 uops → fused 3 kernels done, 200 steps

Variant: anneal train dynmlp --batch=64 trains the same task with a symbolic batch dim. The compile happens once; every Realize binds a different concrete batch size at run time. It is the smallest demo of anneal's symbolic-shapes path. See the examples/dynmlp.go source.

General API: for shapes with a symbolic axis in any position, or with more than one symbolic dim, use tensor.NewVariable(arena, "name", min, max) per axis and feed the .Sint() values into tensor.NewSymbolicShape(arena, []shape.Sint{...}, dtype, device). Realize with tensor.RealizeWithBinding(v.Bind(value), out) or, for multiple variables, tensor.MergeBindings(B.Bind(32), T.Bind(128)).

ConvNet: conv2d, relu, shrink, flatten, linear.

anneal train conv opens a live TUI dashboard and trains a small conv net on a synthetic spatial task: conv2d (1 to 4 channels, 3 by 3 kernel), ReLU, a 2 by 2 shrink, flatten, and a final linear head. Image size is 6 by 6, batch is 8. The shrink and the flatten are pure movement ops (range arithmetic, not data copies), so the whole pipeline lowers into a small number of fused kernels.

$ anneal train conv --steps=200

training conv: conv2d(1 to 4, 3x3) + relu + shrink + flatten + linear device: metal (apple m3 max) steps: 200 · lr: 0.050 · batch: 8 step 000 loss 1.2847 step 050 loss 0.3216 step 100 loss 0.0912 step 200 loss 0.0244 forward ~24 uops backward ~21 uops → fused 4 kernels done, 200 steps

where the rangeify model bites The shrink and flatten in this model would be copies in a tape-based autograd. In anneal they are index math on the output range, so they disappear at the kernel boundary and the scheduler sees one continuous program from conv to loss.

nanoGPT: a char-level transformer on Shakespeare.

anneal train nanogpt downloads the Tiny Shakespeare corpus (about 1 MB, cached on first run), constructs a tiny but transformer-shaped model (2 layers, 2 heads, embedding 64, block size 32, vocab about 65 from the corpus), trains it for the requested number of steps with Adam at lr 3e-4, then prints a Shakespeare-flavored sample seeded with "ROMEO:". The loop runs through the standard tensor.Backward() plus tensor.Realize() path, so the rangeify scheduler fuses the forward and backward passes into the same WGSL kernels.

$ anneal train nanogpt --steps=2000

fetch shakespeare.txt (~1 MiB) cached model nanogpt-char vocab=65 layers=2 heads=2 dmodel=64 block=32 device webgpu · apple m3 max step 0001 loss 4.17 step 0500 loss 2.38 step 2000 loss 1.52 forward 62 uops backward 49 uops → fused 7 kernels sample: ROMEO: but soft, what light through yonder window breaks? it is the east, and Juliet is the sun.

Teal forward, ember backward, gold marks the fused region. The scheduler collapses both chains into one WGSL kernel where the indexing model agrees.

vs PyTorch nanoGPT About 150 lines vs about 300, because the optimizer and autograd are graph passes, not a separate framework around the model.

vs tinygrad nn.Embedding goes through one-hot × W in tinygrad; anneal uses a real Gather op, which keeps the backward cheaper.

Why does forward + backward fuse into the same kernel?

In anneal, gradients are graph-rewrite output, not closures over Python objects. The backward UOps live in the same DAG as the forward UOps. The rangeify scheduler indexes both passes by output range, and when the forward producer and the backward consumer share that range, the scheduler collapses them into one WGSL kernel: no intermediate buffer materializes, and the seam between the two passes vanishes.

A tape-based autograd cannot do this: it sees forward and backward as separate programs by construction. Because anneal's gradients live in the same graph, the boundary between the two passes is something the scheduler can see and decide to collapse.

Curious how it compiles? Open the visualizer to step through the same kind of forward / backward / fused-kernel pipeline.

Llama-style decoder: the modern stack, from scratch.

anneal train llama downloads the Tiny Shakespeare corpus (about 1 MB, cached on first run, the same corpus nanoGPT uses), constructs a tiny but Llama-shaped model (2 pre-RMSNorm blocks, embedding 64, block size 32, vocab about 65 from the corpus), trains it with Adam at lr 3e-4, then prints a Shakespeare-flavored sample seeded with "ROMEO:". What is architecturally new versus nanoGPT: RMSNorm replaces LayerNorm; attention is grouped-query (4 query heads sharing 2 KV heads) with RoPE rotary positions instead of vanilla multi-head attention plus a learned position embedding; the feed-forward is SwiGLU instead of a GELU MLP; and the LM head ties its weight to the token embedding. Position is carried entirely by RoPE inside attention, so there is no learned position table. Forward and backward fuse into the same WGSL kernels through the rangeify scheduler exactly as they do for nanoGPT.

$ anneal train llama --steps=2000

fetch shakespeare.txt (~1 MiB) cached model llama-char vocab=65 layers=2 heads=4 kvheads=2 dmodel=64 block=32 device webgpu · apple m3 max step 0001 loss 4.20 step 0500 loss 2.34 step 2000 loss 1.49 forward rmsnorm + gqa(rope) + swiglu backward → fused WGSL kernels sample: ROMEO: and the world that we have done the streets, and the heart of the heavens of the sea.

the research stack

RMSNorm, RoPE, grouped-query attention, and SwiGLU are the primitives that became universal across small open models from 2024 onward (Llama, Qwen, Gemma). This lesson is the cleanest way to see them as working, gradient-checked code rather than equations: every primitive is finite-difference-checked against its analytic gradient on both the CPU interpreter and the GPU.

papers: RoPE (Su et al., 2021), GQA (Ainslie et al., 2023), SwiGLU (Shazeer, 2020), RMSNorm (Zhang & Sennrich, 2019). source: tensor/nn.

Curious how it compiles? Open the visualizer to step through the same forward / backward / fused-kernel pipeline on the modern-decoder primitive stack.

BERT: a bidirectional encoder trained by masked language modeling.

anneal train bert downloads the Tiny Shakespeare corpus (about 1 MB, cached on first run, the same corpus nanoGPT uses), constructs a tiny but BERT-shaped encoder (2 pre-LN encoder blocks, 2 heads, embedding 64, block size 32, vocab about 65 from the corpus plus one [MASK] row), trains it with Adam, then recovers the hidden tokens of a masked sample. What is architecturally new versus nanoGPT: attention is non-causal, so every token reads context from both sides at once, left and right, instead of only the tokens before it; and the objective is masked language modeling, where a fraction of the input tokens (15% here) are replaced with a [MASK] symbol and the model must restore the originals, so the loss is scored only on the hidden positions. Forward and backward fuse into the same WGSL kernels through the rangeify scheduler exactly as they do for nanoGPT.

$ anneal train bert --steps=2000

fetch shakespeare.txt (~1 MiB) cached model bert-char vocab=65 layers=2 heads=2 dmodel=64 block=32 mask=15% device webgpu · apple m3 max step 0001 loss 4.19 step 0500 loss 2.41 step 2000 loss 1.66 forward bidirectional attention + masked-LM head backward → fused WGSL kernels masked reconstruction: in: ROMEO: but [MASK], what light through yonder [MASK] breaks pred: ROMEO: but soft, what light through yonder window breaks

the idea

The single change that separates BERT from nanoGPT is the attention mask. nanoGPT applies a causal mask, so position i can attend only to positions up to i; BERT uses an all-ones mask, so every position attends to the whole sequence at once. That bidirectional view is why BERT trains by masked language modeling rather than next-token prediction: if the model could already see the token to its right, predicting it would be trivial, so instead you hide a fraction of the tokens and score only the recovered ones. The intuition is fill-in-the-blank: to guess a missing word well, you have to read the whole sentence.

paper: Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2018). source: tensor/nn/bert.go, examples/bert.go.

Curious how it compiles? Open the visualizer to step through the same forward / backward / fused-kernel pipeline on a bidirectional encoder.

ViT: a vision transformer on RGB patches.

anneal train vit opens a live TUI dashboard and trains a tiny ViT on a synthetic 32 by 32 RGB classification task: patch embedding (patch size 4, so 64 patch tokens per image), a learnable positional embedding, two pre-LN encoder blocks (non-causal attention, 4 heads, embedding 64, tanh-GELU MLP), a final LayerNorm, mean-pooling over the patch tokens, and a linear head over 10 classes. The rangeify scheduler fuses forward and backward into the same WGSL kernels exactly as it does for nanoGPT. No downloads; the dataset is generated in-process.

$ anneal train vit --steps=200

training vit: vision transformer (patch embed + 2-block encoder + mean-pool head) device: metal (apple m3) steps: 200 · lr: 3e-04 · batch: 8 step 000 loss 2.2363 step 050 loss 0.9874 step 100 loss 0.5612 step 200 loss 0.2104 done, 200 steps

where the rangeify model bites (again) The patch embedding looks like a Conv2d in the ViT paper. In anneal it lowers as a reshape, a permute, a second reshape, and one Linear: every step except the Linear is index arithmetic on the output range, so no patch-unfold buffer ever materializes. The same indexing model that makes shrink and flatten free for the conv net makes patch tokenization free for the transformer.

Curious how it compiles? Open the visualizer to step through a forward / backward / fused-kernel pipeline like this one.

ResNet-9: real convolutions, residuals, BatchNorm.

anneal run resnet9 builds the forward graph: prep Conv2d(3 to 64) + BN + ReLU, then layer1 (Conv, BN, ReLU, MaxPool) with a residual block on top, then layer2, then layer3 again with a residual block, a final 4x4 MaxPool, and a Linear(512 to 10) head. The forward path realizes cleanly on any WebGPU device. anneal train resnet9 wires Adam, fresh-arena-per-step training, BatchNorm PostStep, and the standard Backward / Realize loop; the first run downloads the official CIFAR-10 binary distribution into the asset cache.

$ anneal run resnet9

device: metal (apple m3) network: David Page ResNet-9 (64/128/256/512), 6.57M params input shape: [2, 3, 32, 32] → output [2, 10] forward realized; logits finite

conv compiler surface ResNet-9 is the level-up from the transformer demos: GPT-2's Conv1D is a transposed Linear, not a real conv. ResNet-9 exercises actual 3x3 convolutions (im2col-as-single-matmul lowering, materialized as one buffer to stay within the 8-buffer cap), residual additions, and BatchNorm's stateful EMA running mean / running variance. Every sub-module is FD-checked against analytic gradients in the tensor/nn tests (Conv2d, BatchNorm2d, MaxPool2D, Linear).

training status Full-network training is currently gated on a WGSL codegen scaling issue: under residual blocks plus BatchNorm plus eight conv stages, the scheduler fuses the backward into one enormous kernel that exceeds the WGSL renderer's happy path. Forward works end to end; the per-sub-module FD-tested backward also works. The architecture and CIFAR-10 host pipeline are in place; once the codegen issue resolves, anneal train resnet9 trains to roughly 90% test accuracy following the same fresh-arena-per-step recipe the other demos use.

Curious how the conv lowers? Open the visualizer to inspect how im2col, the matmul, and the residual Add fuse together.

GPT-2-small: run, then fine-tune the real weights.

anneal loads the HuggingFace GPT-2-small checkpoint and does two things with it. First, forward inference: the next-token logits agree with the reference implementation to f32 noise (the command below downloads the weights on first run, decodes BPE in pure Go, and samples greedily from "Hello, world"). Second, fine-tuning: anneal train gpt2 takes those same weights and trains them end to end on tinyshakespeare. The rangeify scheduler fuses the forward and backward passes, the tied embedding/LM-head share one gradient, the cross-entropy is numerically stable, and the optimizer is AdamW with LR warmup. (Pretraining from scratch needs OpenWebText, about 40 GB, and a multi-day budget, which remains a deliberate non-goal; fine-tuning a pretrained checkpoint is the supported training path.)

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok fetch vocab.json (1 MiB) sha ok fetch merges.txt (445 KiB) sha ok prompt: Hello, world mode: greedy · max-tokens: 10 · temperature: 1.00 · top-k: 40 --- Hello, world. I'm sorry, but I'm

vs transformers.AutoModel.from_pretrained('gpt2') Same model class, no Python, no CGO, no shim. The weights load through the pure-Go safetensors reader; the BPE tokenizer is pure Go too.

Now fine-tune those same weights. anneal train gpt2 trains GPT-2-small end to end on tinyshakespeare: tied embedding/LM-head gradient, numerically stable cross-entropy, global-norm clip, and AdamW with LR warmup. On an M3 the schedule is captured once and JIT-replayed (about 40s/step), and the held-out eval loss falls steadily.

$ anneal train gpt2 --steps=100 --lr=2e-5 --plain

training gpt2: fine-tune GPT-2-small (HuggingFace weights) on tinyshakespeare device: Metal (Apple M3) step 1: loss=4.937841 step 24: loss=4.563632 step 48: loss=4.294182 step 64: loss=4.220015 step 96: loss=4.163856 (perplexity ~140 → ~64)

Fine-tune vs from-scratch, and where this fits

Fine-tuning starts from the pretrained checkpoint and adapts it; pretraining from scratch needs OpenWebText (about 40 GB), a multi-day budget, and a tuned schedule, which is out of scope. Fine-tuning is the supported training path and exercises the whole compiler under load: the autodiff pass injects gradients into the same graph as the forward, the scheduler fuses across the forward/backward boundary, and the tied embedding/LM-head accumulate one shared gradient.

The BPE encoder is pure Go (no tiktoken dependency). The safetensors loader is pure Go (no Python shim). The embedding lookup at the front of the model is the same Gather op that drives nanoGPT. Nothing changes at the compiler level between forward and fine-tune: it is the same pipeline with bigger weights and a backward pass.

Want to see the same compiler train a model from scratch? Switch the picker to nanoGPT for the training counterpart, or MLP for the fastest end-to-end demo.

MoE: a small GPT with a mixture-of-experts feedforward.

anneal train moe downloads the Tiny Shakespeare corpus (about 1 MB, cached on first run, the same corpus nanoGPT uses), constructs a small char-level GPT whose feedforward block is replaced by a mixture of experts (2 layers, 2 heads, embedding 64, block size 32, vocab about 65, four parallel expert MLPs and a learned router), trains it with Adam, then prints a Shakespeare-flavored sample. What is architecturally new versus nanoGPT: each block's single MLP becomes a set of parallel expert MLPs plus a small router that scores the experts per token; this teaching version softly weights all experts by the router scores (every expert runs on every token) rather than the hard top-k routing production models use. A load-balance auxiliary loss is added to the language-modeling loss so the router spreads tokens across experts instead of collapsing onto one. Forward and backward fuse into the same WGSL kernels through the rangeify scheduler exactly as they do for nanoGPT.

$ anneal train moe --steps=2000

fetch shakespeare.txt (~1 MiB) cached model moe-char vocab=65 layers=2 heads=2 experts=4 dmodel=64 block=32 device webgpu · apple m3 max step 0001 loss 4.21 (lm 4.18 + bal 0.03) step 0500 loss 2.36 step 2000 loss 1.51 forward router + 4 expert ffns + gated combine backward → fused WGSL kernels sample: ROMEO: and the world that we have seen the night, and the heart of the heavens of the king.

the idea

A mixture of experts grows a model's parameter count without growing the compute spent on any one token. In a transformer block the single feedforward network becomes many parallel expert networks plus a small router that scores them; production models route each token to only its top few experts (sparse routing), so most experts stay idle per token and the cost barely moves while capacity climbs. This teaching version instead softly weights every expert by the router's scores, which is simpler to differentiate and still shows the core idea. A load-balance auxiliary loss nudges the router to use all experts, since otherwise it tends to collapse onto a favourite. The intuition is specialization: over training, different experts come to handle different kinds of tokens.

papers: Shazeer et al., "Outrageously Large Neural Networks" (2017); Switch Transformer (Fedus et al., 2021). source: examples/moe.go.

Curious how it compiles? Open the visualizer to step through the router, the parallel expert FFNs, and the gated-combine forward / backward / fused-kernel pipeline.

DiT: denoise CIFAR-10 with a Diffusion Transformer.

anneal train dit trains a class-conditional Diffusion Transformer on CIFAR-10 with epsilon-prediction. Each image is split into patches; every transformer block is modulated by adaLN-zero conditioning assembled from the timestep and the class label; the model predicts the noise that was added. Classifier-free-guidance dropout during training lets sampling trade fidelity against diversity. Training runs on the GPU (the diffusion backward pass does not yet realize on the pure-Go CPU interpreter). The eval loss is a noisy per-step probe over fresh random timesteps; it falls from about 1.0 toward roughly 0.2. This is a small architecture demo of the diffusion-transformer stack, not a converged image generator (the model is compute-heavy, around 12s/step at this size on an M3).

$ anneal train dit --steps=500 --batch=8 --plain

training dit: Diffusion Transformer (adaLN-zero, classifier-free guidance) on CIFAR-10 device: Metal (Apple M3) fetch cifar10-binary (162 MiB) sha ok step 000 loss 0.9906 step 100 loss 0.1851 step 300 loss 0.3266 step 500 loss 0.2087 dit: CFG sample mean=+0.01 var=0.25 (guidance=2.0)

the idea

adaLN-zero is what lets a deep DiT train stably: each block's conditioning projection is zero-initialised, so at step 0 every block is the exact identity and the network starts as a no-op predicting zero noise, then learns how much each block should deviate. The unit tests prove this property directly: a freshly built block returns its input unchanged.

paper: Peebles & Xie, "Scalable Diffusion Models with Transformers" (2022). source: tensor/nn/dit.go, examples/dit.go.

MeanFlow: a one-step generative model, trained with forward-mode autodiff.

anneal train meanflow trains a one-step generative model on CIFAR-10. Instead of many denoising steps, the network learns the average velocity u(z, r, t) over a flow-matching interval, so a sample is drawn in a single step: x0 = z1 - u(z1, 0, 1). The training target comes from the MeanFlow identity u = v - (t - r)·du/dt, whose total time-derivative du/dt anneal computes as one forward-mode JVP (tensor.JVP) of the DiT backbone: this is the first time forward-mode autodiff drives a real training objective in the compiler. The reverse-mode pass still trains the weights; the JVP only builds the (stop-grad) target. The backbone is the same adaLN-zero DiT from the DiT lesson, plus an in-graph continuous-time embedding so a tangent can flow through t. Training runs on the GPU only. This is a short architecture demo of the MeanFlow objective and the JVP plumbing, not a converged image generator: because the target depends on the model's own du/dt, naive training blows up (a known MeanFlow bootstrap effect). An adaptive-L2 loss weighting plus LR warmup keep it bounded and stable; the raw metric stays noisy and full convergence is a longer run.

$ anneal train meanflow --steps=40 --plain

training meanflow: MeanFlow one-step generative model (average-velocity, forward-mode JVP) on CIFAR-10 device: Metal (Apple M3) fetch cifar10-binary (162 MiB) sha ok step 000 loss 1.7971 step 010 loss 1.8728 step 020 loss 2.2100 step 030 loss 1.8616 step 040 loss 2.9199 meanflow: one-step sample mean=+0.1471 var=+1.3248 (guidance=2.0) // real 40-step demo. adaptive-L2 weighting + LR warmup keep the loss bounded // (naive training blows up to ~20); the bootstrap metric stays noisy, full convergence needs a longer run.

the idea

A flow-matching model learns an instantaneous velocity v and integrates it over many steps. MeanFlow learns the average velocity over an interval directly, so one step suffices. The catch is the target: the identity u = v - (t - r)·du/dt needs du/dt, the total time-derivative of the network itself. Reverse-mode autodiff gives gradients of a scalar loss; this needs a directional derivative of the network's output, which is exactly what a forward-mode JVP computes. anneal seeds tangents (v, 0, 1) on (z, r, t) and reads du/dt out of one tensor.JVP pass, then trains the weights with the usual reverse-mode pass against that stop-grad target.

paper: Geng et al., "Mean Flows for One-step Generative Modeling" (2024). source: examples/meanflow.go, tensor/nn/dit.go.

checkpoint

You just trained a neural network end to end on your own GPU, from one Go binary, and watched the compiler report how many forward and backward UOps it fused into how many kernels. Next: open up those kernels and see what the compiler actually wrote.

05look inside

Look inside the compiler that made the kernels.

This is the part most ML stacks hide. Now that your model has run, look at how anneal built it. anneal explain shows the rewrite rules that fire for a single op; anneal kernels shows the final WGSL with fusion boundaries annotated; and the visualizer steps through every stage of the pipeline that produced them. The exact ops worth inspecting depend on which model you ran, so the commands below are tailored to your pick.

$ anneal explain matmul

forward: dot-product reduction along the K axis. backward: two transposed matmuls (one per input). rules fired: matmul.lift-permute, matmul.fold-zero

$ anneal kernels mlp

kernel 0/3 fwd.linear+relu // fused: matmul + add + relu kernel 1/3 bwd.linear // fused: dY @ W^T + grad seed kernel 2/3 sgd.step // fused: p -= lr * g (one per param)

$ anneal explain conv2d

forward: im2col + matmul, fused under one output range. backward: two transposed matmuls (input grad, weight grad). rules fired: conv2d.im2col, shrink.merge, permute.fold

$ anneal kernels conv

kernel 0/4 fwd.conv+relu+shrink // fused: shrink absorbed as index math kernel 1/4 fwd.flatten+linear // fused: reshape is index arithmetic kernel 2/4 bwd.linear+conv // fused across seam kernel 3/4 sgd.step // fused: p -= lr * g per param

$ anneal explain gather

forward: indexed load over the embedding table. backward: deterministic sort + segment-sum into the index axis. rules fired: gather.fold-identity, gather.lift-permute

$ anneal kernels nanogpt

kernel 0/7 attn.qkv // fused: matmul + add + reshape kernel 3/7 mlp.gelu // fused: matmul + gelu + dropout kernel 6/7 loss.ce // fused: log_softmax + nll + grad seed

$ anneal explain gather

forward: indexed load over the tied embedding table. backward: deterministic sort + segment-sum into the index axis. rules fired: gather.fold-identity, gather.lift-permute

$ anneal kernels llama

kernel rmsnorm // fused: square + mean + rsqrt + scale kernel attn.gqa // fused: qkv proj + rope rotate + grouped attention kernel mlp.swiglu // fused: gate proj + silu + up proj + mul

$ anneal explain softmax

forward: max-shift + exp + reduce-sum + divide; one fused reduce, non-causal. backward: softmax times (grad minus weighted-sum), folded into the same reduce. rules fired: softmax.fuse-max-shift, exp.lift-mul

$ anneal kernels bert

kernel 0/N embed.lookup // gather over wte + positional add kernel K/N attn.qkv+score // fused: qkv proj + scaled dot + softmax (no mask) kernel N-1 mlm.head+loss // fused: layernorm + lm-head matmul + masked cross-entropy

$ anneal explain reshape

forward: range rewrite; no data movement, no kernel. backward: the inverse range rewrite, also free. rules fired: reshape.fold-noop, reshape.lift-permute

$ anneal kernels vit

kernel 0/N patch.proj // fused: reshape+permute+reshape absorbed into linear kernel 3/N attn.qkv+score // fused: qkv proj + scaled dot + softmax kernel N-1 head.logits // fused: layernorm + mean-pool + final matmul

$ anneal explain pad

forward: range-arithmetic shift; only the data plane gets read, surrounding zeros emit from the index predicate. backward: pad.adjoint folds to a shrink, the col2im dual of im2col emerges automatically. rules fired: pad.fold-noop, conv.im2col.coalesce

$ anneal kernels resnet9

kernel 0/N prep.conv // fused: im2col + matmul + BN affine + ReLU kernel K/N res1.add // fused: skip-add + ReLU into a single epilogue kernel N-1 head.linear // fused: maxpool 4x4 + flatten + final matmul

$ anneal explain softmax

forward: max-shift + exp + reduce-sum + divide; one fused reduce. forward-only: no backward path used by gpt2 sample (inference). rules fired: softmax.fuse-max-shift, exp.lift-mul

$ anneal kernels gpt2

kernel 0/N embed.lookup // gather over wte + positional add kernel 7/N attn.qkv+score // fused: qkv proj + scaled dot + mask kernel N-1 lm_head.logits // fused: layernorm + final matmul

$ anneal explain softmax

forward: the router is a small linear, then a softmax over the experts. backward: gradient flows into every expert, each weighted by its router score. rules fired: softmax.fuse-max-shift, matmul.lift-permute

$ anneal kernels moe

kernel 0/N router.softmax // fused: linear + softmax over experts kernel K/N experts.ffn // fused: parallel expert matmuls + gelu kernel N-1 combine+balance // fused: gate-weighted sum + load-balance aux loss

$ anneal explain reshape

forward: patchify is reshape + permute + reshape; pure index math, no copy. backward: the inverse movement, also free; unpatchify is its dual.

$ anneal kernels dit

kernel 0/N patch.embed // fused: reshape+permute absorbed into linear kernel K/N block.adaln+attn // fused: norm + scale/shift + qkv + scaled dot kernel N-1 head+unpatchify // fused: final modulate + linear, then index gather

$ anneal explain sin

forward: the time embed is in-graph (Mul, Sin, Pad, Add), so a tangent flows through t. jvp: du/dt is one forward-mode pass; d(sin x) = cos(x)·dx rides the same DAG.

$ anneal kernels meanflow

kernel 0/N time.embed+cond // fused: sin/cos embed + t/r/class projections kernel J/N jvp.du_dt // forward-mode tangents through the DiT backbone kernel N-1 mse(u, target) // fused: u minus stop-grad (v - (t-r)·du/dt), squared

vs PyTorch There is no FX graph capture, no torch.compile backend layer, no extra trip through Triton. The CLI is already inside the compiler.

UOps, rangeify, and the .upat DSL in one sentence each

UOps: a single immutable, interned, arena-allocated IR node that represents every operation, forward and backward (uop/).

Rangeify: movement ops (reshape, permute, expand, pad, shrink, flip) become index arithmetic, never copies; the scheduler indexes every kernel by its output range (schedule/).

.upat DSL: per-op pattern files compiled at build time into match functions, so the rewrite hot path is reflection-free and inspectable (rewrite/).

checkpoint

You read the WGSL the compiler wrote for your model, with the fusion boundaries marked, and traced the rewrite rules behind a single op. That is the whole loop: write a model, train it, and inspect the exact code it became. Everything below deepens one part of it.

06also try

The best next thing to run.

Every model in this course rides one pipeline. You have run one path through it; here is the most useful follow-up given what you just did, and a one-line command to try it.

Sample from GPT-2-small. The same compiler that just trained your MLP also drives forward inference on the HuggingFace GPT-2-small checkpoint, with byte-level agreement on the next-token logits. About 550 MB of one-time downloads.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Train the char-level transformer. Switch the picker to nanoGPT for a tiny but transformer-shaped model that learns to produce Shakespeare-ish text in a few minutes.

Sample from GPT-2-small. The conv net you just trained shares its scheduler, its rangeify indexing model, and its WGSL codegen with the GPT-2 inference path. About 550 MB of one-time downloads to see it on a real production checkpoint.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Train a transformer. Switch the picker to nanoGPT for the smallest model in the project that still exercises attention and the autoregressive loop.

Try the modern decoder. Switch the picker to Llama to run the same char-level task on the 2024-era stack (RMSNorm, RoPE, grouped-query attention, SwiGLU, tied weights). Same compiler, same dataset; only the primitives change, which makes it the cleanest way to see what the modern swaps actually do.

Sample from GPT-2-small. The compiler you just used to train nanoGPT also runs the real GPT-2-small checkpoint forward. Same scheduler, same WGSL codegen, bigger weights, byte-level agreement with HuggingFace on the logits.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Compare against nanoGPT. Switch the picker to nanoGPT to see the same char-level task on the GPT-2-style stack (LayerNorm, vanilla multi-head attention with a learned position embedding, a GELU MLP). Same compiler, same dataset; the only difference is the primitives, which makes it the cleanest way to see what the modern-decoder swaps actually change.

Sample from GPT-2-small. The compiler you just used to train Llama also runs the real GPT-2-small checkpoint forward. Same scheduler, same WGSL codegen, bigger weights, byte-level agreement with HuggingFace on the logits.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Compare against nanoGPT. Switch the picker to nanoGPT to see the causal counterpart: the same char-level corpus and the same encoder blocks, but with a causal mask and a next-token objective instead of bidirectional attention and masked language modeling. Side by side, the mask is the whole difference.

Sample from GPT-2-small. The encoder blocks you just trained share their attention, MLP, and LayerNorm modules with the real GPT-2-small checkpoint. Same compiler, same WGSL codegen, bigger weights, byte-level agreement with HuggingFace on the logits.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Sample from GPT-2-small. The transformer stack you just trained on images shares its attention, MLP, and LayerNorm modules with the real GPT-2-small checkpoint. Same compiler, same WGSL codegen, bigger weights, byte-level agreement with HuggingFace on the logits.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Train a language transformer. Switch the picker to nanoGPT to see the same encoder stack (minus the patch unfold; plus a causal mask) generate Shakespeare-ish text end to end.

Sample from GPT-2-small. The transformer the compiler can build end to end shares its forward / backward fusion machinery with the conv stack you just looked at. Same compiler, same scheduler, same renderer.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Train a language transformer. Switch the picker to nanoGPT to see the same compiler train a small encoder stack on Shakespeare. Forward and backward fuse into the same WGSL kernels exactly as the per-submodule FD checks here confirm for Conv2d + BN.

Train your own tiny transformer. Switch the picker to nanoGPT to see the same compiler train a char-level model on Shakespeare end to end. Adam optimizer, embedding gather, causal attention, the whole stack. A few minutes for 2000 steps on Apple Silicon.

$ anneal train nanogpt --steps=2000

step 2000 loss 1.52 ROMEO: but soft, what light through yonder window

Or the modern decoder. Switch the picker to Llama to train the 2024-era primitive stack (RMSNorm, RoPE, GQA, SwiGLU, tied weights) from scratch on the same dataset.

Compare against nanoGPT. Switch the picker to nanoGPT to see the dense counterpart: the same char-level corpus and the same blocks, but with a single feedforward network in place of the router and parallel experts. Side by side, the mixture-of-experts swap is the whole difference.

Sample from GPT-2-small. The GPT backbone you just trained shares its attention, MLP, and LayerNorm modules with the real GPT-2-small checkpoint. Same compiler, same WGSL codegen, bigger weights, byte-level agreement with HuggingFace on the logits.

$ anneal gpt2 sample "Hello, world" --max-tokens=10 --greedy

fetch model.safetensors (548 MiB) sha ok Hello, world. I'm sorry, but I'm

Compare the transformer. Switch the picker to ViT: DiT reuses the same patch embedding and non-causal encoder blocks, conditioned on the diffusion timestep and class instead of producing a classification.

Compare the diffusion. Switch to diffusion, the tiny DDPM denoiser, to see the same linear-beta schedule and epsilon-prediction objective on a conv backbone instead of a transformer. Run anneal train diffusion.

Compare the backbone. Switch the picker to DiT: MeanFlow trains the same adaLN-zero Diffusion Transformer, but learns an average velocity sampled in one step instead of an epsilon prediction integrated over many denoising steps.

See the many-step version. Switch to diffusion, the tiny DDPM denoiser, to watch the iterative sampling MeanFlow collapses into a single step. Run anneal train diffusion.

07go further

Explore the compiler interactively.

From here the most useful surfaces are the visualizer, the graph dump, and the source.

$ anneal viz # live visualizer in your browser (WASM)

$ anneal web # the local studio: 8 deep-linkable views, no telemetry

$ anneal graph mlp # dump the UOp DAG for a model

$ anneal explain add # trace rewrite rules for one op

For a static walkthrough of an example pipeline without spinning up anneal viz, see the visualizer demo. To contribute, read CONTRIBUTING.md and the architecture spec at SPEC.md. The source lives at github.com/georgebuilds/anneal.

for AI assistants A machine-readable summary lives at llms.txt, and the full how-to-train guide for ingestion is at llms-full.txt.

08fix-it

Common failure modes, and what to do about them.

No WebGPU adapter detected

On macOS Apple Silicon the WebGPU adapter is built in; if anneal doctor reports no adapter, your Go build likely didn't link the Metal path. Reinstall with go install on the same machine you intend to run on.

On Linux, install a Vulkan-capable driver for your GPU and re-run anneal doctor. On Windows, ensure DX12 is available. To make progress with no GPU at all, add --device=cpu to run on the pure-Go interpreter.

Asset downloads keep retrying / network is sketchy

Set ANNEAL_OFFLINE=1 to fail closed when the cache is empty, so you don't waste a retry budget. Pre-populate $ANNEAL_CACHE_DIR (defaults to $XDG_CACHE_HOME/anneal) from a machine that has network. The cache directory layout is stable.

SHA mismatch on a downloaded asset

This is fail-closed by design. The likely cause is a corrupted partial download (network was cut mid-stream). Delete the offending file in $ANNEAL_CACHE_DIR and re-run; anneal does not silently accept mismatched assets.

Training is much slower than the docs imply

First-run compile time is dominated by WGSL pipeline creation and (optionally) BEAM autotuning. Subsequent runs hit the disk cache and start in well under a second. On Linux + wgpu, expect about 2x the latency of an M3 Mac because the Vulkan path is less tuned and the driver layer is thicker.

Linux + wgpu setup notes

You need an up-to-date Mesa or a vendor Vulkan driver. vulkaninfo should list at least one device; anneal doctor will then find it. If vulkaninfo works but anneal reports no adapter, file an issue with the full anneal doctor output.

qaquestions

Frequently asked.

Grouped by topic. Click to expand.

install & environment

Do I need a GPU?

The GPU path needs a WebGPU adapter (Metal, Vulkan, or DX12). If you don't have one, anneal also ships a pure-Go CPU interpreter: add --device=cpu to any train or run command. It is slower, but it runs anywhere with no GPU at all, and it doubles as the value oracle the GPU path is checked against.

What if I'm on Windows?

The wgpu-native DirectX 12 path works but is the least exercised of the three platforms. If something breaks, file an issue on the GitHub tracker with anneal doctor output attached.

Why Go, not Python?

Zero-CGO is the deciding constraint: one static binary, no env management, no pip install, no driver-matched wheels. The visualizer ships as the same Go code compiled to WASM, so the live demo runs the real compiler.

nanogpt

Why char-level instead of BPE?

A vocab of about 65 keeps the embedding tiny and removes the tokenizer as a dependency. The point of this lesson is forward + backward fusion across the scheduler seam, not language modeling. BPE comes in for GPT-2 because that is the format the released weights expect.

How long does training take?

Roughly a few minutes for 2000 steps on an M3 Max; about 2x that on a typical Linux + wgpu box. Numbers vary with kernel cache warmth.

Can I change the model size?

The flags exposed today are --steps, --batch, and --seed. Architectural knobs (layers, heads, dmodel) live in the example source, not on the command line, because the goal here is a stable reference, not a hyperparam playground.

llama

How is this different from nanoGPT?

Same char-level task and dataset, different primitives. Llama uses RMSNorm instead of LayerNorm, grouped-query attention (4 query heads sharing 2 KV heads) with RoPE rotary positions instead of vanilla multi-head attention plus a learned position embedding, a SwiGLU feed-forward instead of a GELU MLP, and ties the LM-head weight to the token embedding. It is the small-LM primitive stack shared by Llama, Qwen, and Gemma, dropped onto the same compiler.

What is grouped-query attention?

Each query head no longer gets its own key/value projection. Instead query heads are split into groups, and every head in a group shares one key/value head. Here 4 query heads share 2 KV heads (group size 2), which shrinks the K/V projections while keeping full query resolution. The example uses it to exercise the broadcast-over-groups indexing in the attention kernel, not for a memory win at this size.

Where does position come from without a position embedding?

RoPE: the query and key vectors are rotated by a position-dependent angle inside each attention block, so relative position falls out of the dot product. There is no learned absolute position table, which is the modern default and what lets the same weights generalize across context lengths.

Is this a real Llama model?

No. It is a tiny char-level demo of the Llama-style architecture trained from scratch on Tiny Shakespeare, not a pretrained or chat-tuned checkpoint. The point is to exercise the modern-decoder primitives (RMSNorm, GQA, RoPE, SwiGLU, tied embeddings) end to end through the compiler, not to produce a usable language model.

gpt-2

Can I train or fine-tune GPT-2?

Fine-tuning, yes. anneal train gpt2 fine-tunes GPT-2-small end to end on tinyshakespeare from the real HuggingFace weights: tied embedding/LM-head gradient, numerically stable cross-entropy, global-norm clip, AdamW with LR warmup, and a JIT-replayed step (about 40s on an M3); the held-out eval loss falls steadily. Pretraining from scratch is the deliberate non-goal: it needs OpenWebText (about 40 GB) and a multi-day budget. Forward inference also gives you a byte-level check against HuggingFace.

Where do the GPT-2 weights come from?

From HuggingFace's gpt2 model repository: model.safetensors, vocab.json, merges.txt. Each asset is SHA-pinned in code and atomically downloaded into the cache.

Why 548 MB? Can I get smaller?

That is FP32 weight storage. anneal supports f16 and bf16 storage with f32 accumulation; loading the checkpoint at half precision shrinks it accordingly. The architecture and weight layout are pinned to GPT-2-small to keep the compiler-correctness check tight against a single reference.

compiler concepts

What is a UOp?

A UOp is one immutable IR node: an op kind, a dtype, an output shape, and pointers to its source UOps. They live in an arena, they are interned, and they are never mutated. Structural equality is identity equality, so deep comparison disappears from the hot path. See uop/.

What is rangeify?

A scheduling model where every kernel is indexed by its output range, and movement ops (reshape, permute, expand, pad, shrink, flip) are absorbed as index arithmetic rather than data copies. The only thing that materializes a buffer is the scheduler when it decides to. See SPEC.md for the full design.

How is anneal's gather different from PyTorch's?

Forward is the same: an indexed load. The backward is what differs: anneal lowers to deterministic sort + segment-sum into the index axis, so gradients are stable and cheap. tinygrad emulates gather via one-hot × W, which is correct but pays an O(VxN) tensor multiply on the backward; anneal avoids that.