Why Compile-Time Architecture Matters for Inference

Every inference engine today shares a dirty secret: it makes its most important decisions at runtime, under time pressure, with incomplete information. Which kernel to launch for this matmul? Which quantization scheme is accurate enough for this layer? Should we fuse these operations or leave them separate? The answers are deferred to the critical path because the engine doesn't know what hardware it will run on until it runs.

Tribunus Compute inverts this. Every decision is frozen at compile time, before a single token is generated. The key insight is that inference is a deterministic state machine — the shapes are known, the memory budget is fixed, the target backend is identified. There is no reason to guess.

The 6-Check Admission Pipeline

When a compute image is compiled, each candidate kernel and configuration passes through six gates:

Type check — Does the kernel support the requested dtype and memory layout?
Shape check — Do the static shapes fit within the kernel's contract?
Mutation check — Are in-place or aliased writes valid for this backend lane?
Numerical check — Is the candidate within tolerance of the golden reference?
Performance check — Does it meet the latency and throughput floor?
Cache check — Is the autotune cache key consistent with the target deployment?

Only candidates that clear all six checks are admitted into the compute image. Every admit is recorded — the kernel, the parameters, the numerical oracle verdict — alongside the golden FP32 reference from Apple Silicon. The result is a frozen evidence graph that proves every computation is correct and performant before it reaches production.

The Numerical Oracle

At the heart of the admission pipeline is the numerical oracle: Apple Silicon FP32 arithmetic used as the golden reference. Every candidate kernel on every backend — CUDA, Metal, Vulkan, ROCm, oneDNN, Level Zero, TT-NN — is validated against this single source of truth. If a kernel on a Tenstorrent device produces output that diverges from the oracle beyond a configurable tolerance, it is rejected at compile time. No silent accuracy regressions, no "well, it's close enough" at runtime.

Multi-Backend from Day One

Tribunus Compute was designed as a multi-backend engine from the first commit. The Backend Realization Contract (ADR 0037) defines a formal interface that every hardware backend must satisfy: kernel registration, memory layout conventions, stream synchronization, and the admission pipeline contract. Today the engine supports seven backends:

Metal — Apple Silicon (M-series, ANE)
CUDA — NVIDIA GPUs
Vulkan — cross-platform GPU compute
ROCm — AMD GPUs
oneDNN — Intel CPUs and GPUs
Level Zero — Intel discrete GPUs
TT-NN — Tenstorrent accelerators

Each backend speaks the same contract. New backends are added by implementing the contract — the admission pipeline, the oracle validation, and the compile-time planner are all backend-agnostic.

The Compute Image

The output of compilation is a compute image: a self-contained artifact that bundles frozen evidence (the oracle-signed kernel graph) with the compiled code. A compute image is deployable, reproducible, and auditable. No runtime compilation, no JIT surprises, no "it works on my machine." The same image that passes the 6-check pipeline on the developer workstation runs identically in production.

Deep dive on Compute