gbgeorgebuilds anneal

anneal

Anneal is a machine learning compiler written in Go.

It compiles neural networks into GPU code, with reverse-mode autodiff built in.

Autodiff is a compiler pass over the same graph as the forward computation, so it fuses forward and backward work into the same GPU kernels.

go get github.com/georgebuilds/anneal

how it works

One graph, fully fused.

Autodiff and optimization are first-class compiler passes over one immutable UOp IR, not runtime hooks bolted onto a separate tape. Per-op gradient rules are dispatched through a drift-checked ruleset, producing a uniform, fusible backward graph. The scheduler sees the whole computation, forward and backward together, and fuses across that boundary as a natural consequence.

teal:forward pass (solid)
ember:backward pass (dashed)
gold:fused kernel
forward pass
An immutable UOp DAG

Every operation is a node in a shared arena, interned, never mutated. Rewrites produce new nodes. Structural equality is identity equality; no deep comparisons in hot paths.

backward pass
Autodiff as a compiler pass

Gradient computation is a typed compiler pass that walks the forward graph and emits backward UOps via a ruleset drift-checked against curated documentation. The gradient graph lives alongside the forward graph, with per-node rule attribution visible in the visualizer, until the scheduler decides what to fuse.

fused kernel
Fusion across the boundary

The rangeify scheduler sees forward and backward together. It can fuse across that boundary, collapsing both into a single WGSL kernel that a backend-aware scheduler would never produce.

runtime
Symbolic shapes, JIT, f16/bf16 support

Symbolic shapes: a kernel compiles once and runs at any batch size, and the seam extends to split/merge a symbolic axis, symbolic pad/shrink amounts, and multi-dim symbolic dispatch. f16 support (narrowing uses IEEE 754 RTNE) and bf16 storage-only enable low-precision workflows. tensor.JIT captures the execution plan; the scheduler is memoized on a structural key. Kernel autotuning via BEAM search finds the best opt sequence per kernel and caches results to disk (ANNEAL_BEAM=1 to search; default is zero-overhead cache lookup).

rewrite engine
An iterative fixpoint driver

The rewrite engine runs to a fixpoint iteratively, not recursively, so deep graphs that exhaust the stack on the upstream recursive driver compile cleanly here. Rules are .upat patterns compiled to match functions at build time. No reflection on the hot path.

interop
Bring your own weights

Read and write safetensors and .npy/.npz bidirectionally, in pure Go. Load a checkpoint, train or run it, export it back. The whole stack is zero-CGO: no C toolchain in the build, and it runs on Metal today through WebGPU.

Want to walk through this in practice? Install anneal and train nanoGPT on Shakespeare.

the cli

Verbs that mirror the pipeline.

A single static binary, verb-first. Each command exposes a layer of the compiler so the CLI doubles as a teaching surface.

anneal run <model> realize and execute a graph
anneal train mlp training loop with the live TUI (mlp / conv / dynmlp)
anneal viz launch the graph visualizer in-browser (WASM)
anneal graph <model> dump and inspect the UOp DAG
anneal kernels <model> show generated WGSL with fusion boundaries annotated
anneal explain <op> trace the rewrite rules that fire for one op
anneal doctor WebGPU / backend environment check

The tensor/npy and tensor/safetensors packages load .npy/.npz arrays and read/write .safetensors checkpoints in pure Go. No Python dependency at runtime.

$ anneal train mlp
device webgpu · apple m3 max
step 001 loss 2.3194 acc 0.09
step 100 loss 0.8271 acc 0.74
forward 14 uops backward 11 uopsfused 3 kernels
$ anneal explain mul
symbolic x * 1 → x multiplicative identity
symbolic x * y → y * x canonicalization

design

No shortcuts.

The graph-rewrite approach to ML compilation is proven. Anneal's contribution is rigor, ergonomics, and a visualizer that shows you what the compiler is actually doing, compiled to WASM, running in your browser, not a mock.

One immutable IR

UOps are interned, arena-allocated, and never mutated. Every rewrite produces a new node. Structural equality is identity equality. No deep comparisons in hot paths, no GC pressure from temporary objects.

No reflection in the rewrite path

A hard invariant. Pattern matching against the IR uses typed accessors, not runtime introspection. The hot path stays predictable, verifiable, and fast.

The visualizer runs the real compiler

anneal viz compiles the frontend and rewrite engine to WASM and renders the actual UOp graph in-browser. JSON output carries per-node rule attribution. The marketing artifact and the integration test for the rewrite path are the same artifact.

Written in Go

Strict typing. gofmt clean. The WASM build of the compiler frontend is the same binary as the CLI, so the visualizer and the trainer are always in sync, by construction.