anneal
Anneal is a machine learning compiler written in Go.
It compiles neural networks into GPU code, with reverse-mode autodiff built in.
Autodiff is a compiler pass over the same graph as the forward computation, so it fuses forward and backward work into the same GPU kernels.
go get github.com/georgebuilds/anneal
how it works
One graph, fully fused.
Autodiff and optimization are first-class compiler passes over one immutable UOp IR, not runtime hooks bolted onto a separate tape. Per-op gradient rules are dispatched through a drift-checked ruleset, producing a uniform, fusible backward graph. The scheduler sees the whole computation, forward and backward together, and fuses across that boundary as a natural consequence.
Every operation is a node in a shared arena, interned, never mutated. Rewrites produce new nodes. Structural equality is identity equality; no deep comparisons in hot paths.
Gradient computation is a typed compiler pass that walks the forward graph and emits backward UOps via a ruleset drift-checked against curated documentation. The gradient graph lives alongside the forward graph, with per-node rule attribution visible in the visualizer, until the scheduler decides what to fuse.
The rangeify scheduler sees forward and backward together. It can fuse across that boundary, collapsing both into a single WGSL kernel that a backend-aware scheduler would never produce.
Symbolic shapes: a kernel compiles once and runs at
any batch size, and the seam extends to split/merge
a symbolic axis, symbolic pad/shrink amounts, and
multi-dim symbolic dispatch. f16 support (narrowing
uses IEEE 754 RTNE) and bf16 storage-only enable
low-precision workflows. tensor.JIT
captures the execution plan; the scheduler is
memoized on a structural key. Kernel autotuning via
BEAM search finds the best opt sequence per kernel
and caches results to disk (ANNEAL_BEAM=1 to
search; default is zero-overhead cache lookup).
The rewrite engine runs to a fixpoint iteratively,
not recursively, so deep graphs that exhaust the
stack on the upstream recursive driver compile
cleanly here. Rules are .upat patterns
compiled to match functions at build time. No
reflection on the hot path.
Read and write safetensors and
.npy/.npz bidirectionally,
in pure Go. Load a checkpoint, train or run it,
export it back. The whole stack is zero-CGO: no C
toolchain in the build, and it runs on Metal today
through WebGPU.
the cli
Verbs that mirror the pipeline.
A single static binary, verb-first. Each command exposes a layer of the compiler so the CLI doubles as a teaching surface.
anneal run <model>
|
realize and execute a graph |
anneal train mlp
|
training loop with the live TUI (mlp / conv / dynmlp) |
anneal viz |
launch the graph visualizer in-browser (WASM) |
anneal graph <model>
|
dump and inspect the UOp DAG |
anneal kernels <model>
|
show generated WGSL with fusion boundaries annotated |
anneal explain <op>
|
trace the rewrite rules that fire for one op |
anneal doctor
|
WebGPU / backend environment check |
The tensor/npy and
tensor/safetensors packages load
.npy/.npz arrays and read/write
.safetensors checkpoints in pure Go. No Python
dependency at runtime.
design
No shortcuts.
The graph-rewrite approach to ML compilation is proven. Anneal's contribution is rigor, ergonomics, and a visualizer that shows you what the compiler is actually doing, compiled to WASM, running in your browser, not a mock.
UOps are interned, arena-allocated, and never mutated. Every rewrite produces a new node. Structural equality is identity equality. No deep comparisons in hot paths, no GC pressure from temporary objects.
A hard invariant. Pattern matching against the IR uses typed accessors, not runtime introspection. The hot path stays predictable, verifiable, and fast.
anneal viz compiles the frontend and
rewrite engine to WASM and renders the actual UOp
graph in-browser. JSON output carries per-node rule
attribution. The marketing artifact and the
integration test for the rewrite path are the same
artifact.
Strict typing. gofmt clean. The WASM
build of the compiler frontend is the same binary as
the CLI, so the visualizer and the trainer are
always in sync, by construction.