anneal

Name: anneal
Author: georgebuilds

Anneal is a machine learning compiler written in Go.

It compiles neural networks into GPU code, with reverse- and forward-mode autodiff built in.

Autodiff is a compiler pass over the same graph as the forward computation, so it fuses forward and backward work into the same GPU kernels.

go get github.com/georgebuilds/anneal

Start the course see what you'll build

learn

Learn how ML actually compiles, by running one.

anneal doubles as a hands-on course. Pick a model, train it end to end on your own GPU with no Python and no CUDA, then open up the compiler and read the GPU kernels it wrote. You learn the network and the machine underneath it at the same time, because here they are the same thing.

Start the course

how it works

One graph, fully fused.

Autodiff and optimization are first-class compiler passes over one immutable UOp IR, not runtime hooks bolted onto a separate tape. Per-op gradient rules are dispatched through a drift-checked ruleset, producing a uniform, fusible backward graph. The scheduler sees the whole computation, forward and backward together, and fuses across that boundary as a natural consequence.

teal:forward pass (solid)

ember:backward pass (dashed)

gold:fused kernel

forward pass

An immutable UOp DAG

Every operation is a node in a shared arena, interned, never mutated. Rewrites produce new nodes. Structural equality is identity equality; no deep comparisons in hot paths.

backward pass

Autodiff as a compiler pass

Gradient computation is a typed compiler pass that walks the forward graph and emits backward UOps via a ruleset drift-checked against curated documentation. The gradient graph lives alongside the forward graph, with per-node rule attribution visible in the visualizer, until the scheduler decides what to fuse.

fused kernel

Fusion across the boundary

The rangeify scheduler sees forward and backward together. It can fuse across that boundary, collapsing both into a single WGSL kernel that a backend-aware scheduler would never produce.

runtime

Symbolic shapes, JIT, f16/bf16/fp8 support

Symbolic shapes: a kernel compiles once and runs at any batch size, and the seam extends to split/merge a symbolic axis, symbolic pad/shrink amounts, and multi-dim symbolic dispatch. f16 support (narrowing uses IEEE 754 RTNE) and bf16/fp8 storage enable low-precision workflows. tensor.JIT captures the execution plan; the scheduler is memoized on a structural key. Kernel autotuning via BEAM search finds the best opt sequence per kernel and caches results to disk (ANNEAL_BEAM=1 to search; default is zero-overhead cache lookup).

rewrite engine

An iterative fixpoint driver

The rewrite engine runs to a fixpoint iteratively, not recursively, so deep graphs that exhaust the stack on the upstream recursive driver compile cleanly here. Rules are .upat patterns compiled to match functions at build time. No reflection on the hot path.

interop

Bring your own weights

Read and write safetensors and .npy/.npz bidirectionally, in pure Go. Load a checkpoint, train or run it, export it back. The whole stack is zero-CGO: no C toolchain in the build, and it runs on Metal today through WebGPU.

onnx import

ONNX in, UOps out

onnx.Import(bytes, arena, device) parses ONNX 1.17 models via pure-Go protobuf bindings and lowers about 100 op handlers onto the same UOp arena as the rest of the compiler. Symbolic dim_param axes ride through as anneal Variables. The bit-exact gate builds each model twice on the same arena and asserts byte-equal float32 outputs; an onnxruntime ResNet-9 cross-check lands at 8.2e-08. Phase 4 conformance: 174 of 234 ONNX 1.17 node tests pass, 0 fail, 60 documented skips.

anneal web

A local browser studio

anneal web serves a single-binary studio with eight deep-linkable views: visualize, kernels, explain, train, generate, history, doctor, plus the home pane. Every view that compiles runs as WASM in a Web Worker; every view that executes streams over SSE from a native handler. Drop a .onnx file on the home page to inspect topology without ever sending bytes to the server. Zero telemetry, zero accounts. WCAG 2.x AA is a binding gate.

Want to walk through this in practice? Install anneal and train nanoGPT on Shakespeare.

the cli

Verbs that mirror the pipeline.

A single static binary, verb-first. Each command exposes a layer of the compiler so the CLI doubles as a teaching surface.

command	description
`anneal run <model>`	realize and execute a graph
`anneal train mlp`	training loop with the live TUI (mlp, conv, nanogpt, llama, vit, gpt2, dit, bert, moe, and more)
`anneal viz`	launch the graph visualizer in-browser (WASM)
`anneal web`	serve the local studio: eight deep-linkable views, no telemetry
`anneal graph <model>`	dump and inspect the UOp DAG
`anneal kernels <model>`	show generated WGSL with fusion boundaries annotated
`anneal explain <op>`	trace the rewrite rules that fire for one op
`anneal doctor`	WebGPU / backend environment check

The tensor/npy and tensor/safetensors packages load .npy/.npz arrays and read/write .safetensors checkpoints in pure Go. No Python dependency at runtime.

$ anneal train mlp

device webgpu · apple m3 max

step 001 loss 2.3194 acc 0.09

step 100 loss 0.8271 acc 0.74

forward 14 uops backward 11 uops → fused 3 kernels

$ anneal explain mul

symbolic x * 1 → x multiplicative identity

symbolic x * y → y * x canonicalization

design

No shortcuts.

The graph-rewrite approach to ML compilation is proven. Anneal's contribution is rigor, ergonomics, and a visualizer that shows you what the compiler is actually doing, compiled to WASM, running in your browser, not a mock.

One immutable IR

UOps are interned, arena-allocated, and never mutated. Every rewrite produces a new node. Structural equality is identity equality. No deep comparisons in hot paths, no GC pressure from temporary objects.

No reflection in the rewrite path

A hard invariant. Pattern matching against the IR uses typed accessors, not runtime introspection. The hot path stays predictable, verifiable, and fast.

The visualizer runs the real compiler

anneal viz compiles the frontend and rewrite engine to WASM and renders the actual UOp graph in-browser. JSON output carries per-node rule attribution. The marketing artifact and the integration test for the rewrite path are the same artifact.

Written in Go

Strict typing. gofmt clean. The WASM build of the compiler frontend is the same binary as the CLI, so the visualizer and the trainer are always in sync, by construction.