Why Transformers Fit STARKs

Part 1

Transformers fit STARKs because decode already looks like a trace: one stateful step after another.

A note on the title before we start: this is not a claim that transformers do not fit SNARKs.

They do. Modern SNARK-style zkML systems can prove serious transformer workloads.

The narrower claim here is about natural shape: transformer decode is trace-like, and STARKs are trace-first.

The bridge

At first, transformers and STARKs look like they belong to different worlds.

A transformer is usually introduced as a model that predicts the next token.

You give it some text. It reads that text. It produces a probability distribution over what should come next. Then, once one token is chosen, the process repeats.

Transformer view This is the familiar product picture repeated over time: current text goes in, the same transformer runs, and one more token comes out.

A STARK is usually introduced as a proof system for computation.

You describe a computation as a sequence of states. You define a small set of rules for how one state is allowed to turn into the next. Then you prove that the whole sequence obeyed those rules, from the beginning of the computation to the end.

In STARK language, the written version of that run is called a trace.

The name sounds technical, but the idea is plain. Instead of only looking at the final answer, you leave a trail of how the computation got there.

At one moment, the computation is in some state. A small rule is applied. Then the computation is in the next state.

Write that down once and you have one row.

Write it down again and again, and you have a trace.

The important part is the handoff. The state produced by one row has to be the state used by the next row. When those handoffs line up, the rows are not just separate snapshots. They describe one continuous run.

A concrete example is Cairo.

Cairo lets you write normal-looking programs, but it was designed so those programs can be turned into STARK-provable execution. A tiny Cairo function might look like this:

fn add_one(x: u32) -> u32 {
    x + 1
}

As a programmer, you read that as one simple function.

If x is 4, the function should return 5.

But the proof system does not prove that sentence directly.

First, the Cairo compiler turns the program into lower-level Cairo VM instructions. Then the Cairo VM runs those instructions. During that run, the VM keeps a small machine state, including registers such as the program counter and frame/allocation pointers. The execution trace is the row-by-row record of those VM states changing as the program runs.

So a simplified trace story for add_one(4) is not:

“the function returned 5, trust me.”

It is closer to:

“the Cairo VM started in this state, followed the valid next instruction, moved to the next state, followed the next valid instruction, and eventually ended with the output corresponding to 5.”

That is the important shift.

The function is what the developer writes.

The trace is what the prover uses to show the computation really followed the machine rules.

Trace view This is the proof-system picture: the run is many rows, and each row checks that one state is allowed to become the next.

Most people first meet these ideas in different contexts, with different language, for different purposes.

So the bridge between them can sound strange at first.

But the bridge is simpler than it seems.

The core idea of this essay is this:

The short version

Transformer decode already has the kind of shape that a trace-based proof system likes: repeated local work, carried context, and clear boundaries between one step and the next.

Here is the bridge in the smallest possible form.

On the left, we have the way people usually talk about transformer generation.

The model has some text so far. It predicts one more token. Then the new token is appended to the text, and the process repeats.

So the left side says:

step 0 extends The into The cat,
step 1 extends The cat into The cat sat,
step 2 extends The cat sat into The cat sat on,
and so on.

That is the decode story.

On the right, the same ordered process is written in proof-system language.

Instead of saying “the text got one token longer,” we say “the computation moved from one state to the next.”

So the right side says:

row 0 checks the move from state 0 to state 1,
row 1 checks the move from state 1 to state 2,
row 2 checks the move from state 2 to state 3,
and so on.

That is the trace story.

The key point is not that the words are identical.

The key point is that the shape is the same: one ordered step after another, with each step beginning where the previous step ended.

Bridge view The bridge is row-by-row: each decode step can be recorded as a checkable trace row.

This is why the diagram says “same order, different notation.”

It may help to zoom into just one line.

Take step 2. In the decode view, the text before the step is The cat sat. The transformer runs once and produces the next visible token, on. After the step, the visible text has become The cat sat on.

The trace view records the same event with different names. The computation begins from state 2, applies the decode rule, and ends at state 3.

The token on is what the reader sees. The state transition is what the proof system checks.

Bridge close-upOne line of the bridge, zoomed in: the decode step is the human view, and the trace row is the prover view of the same transition.

The decode view talks about tokens because that is what the user sees.

The trace view talks about states because that is what the proof system needs to check.

But both views are describing the same kind of object: a chain of small transitions.

Each decode step has a before and an after.

Each trace row has a before and an after.

If we can choose the state carefully, then the token-by-token process can be written as a row-by-row execution record.

That sentence is easy to say and easy to misunderstand.

So we will take it slowly.

To see why it might be true, we first need to answer some basic questions clearly:

What does “decode” actually mean?
What is a “state” in this setting?
What is being repeated from one token to the next?
What is a trace?
Why would anyone want to write that process down as a trace?
Why does this point toward STARKs?

Once those questions are clear, the connection stops sounding exotic.

It starts sounding natural.

Start with the wrong picture

The wrong picture is this:

A transformer is one giant black box.

A proof system is some wrapper around it.

The only interesting question is whether the wrapper is powerful enough to handle the black box.

That picture is too blunt.

It hides the exact thing that matters most: the internal shape of the computation.

This is why I like Percepta’s “Can LLMs Be Computers?” as a starting point.

It shifts the reader away from “language model as mysterious oracle” and toward “language model as a process that can execute structured work.”

That is the right first move.

But for this essay, the question is narrower:

if a transformer is a process, what proof surface does that process naturally expose?

A proof system is not neutral about shape.

Some proof systems are more naturally expressed when a computation is flattened into one large algebraic relation.

Others feel more natural when the computation can be written as:

a sequence of states,
a repeated local rule,
and explicit conditions on how the process begins and ends.

So before asking “can we prove a transformer?”, it helps to ask a better question:

What kind of computation is a transformer performing during decode?

That shift is small in wording, but very large in effect.

Once you stop treating the model as one black box and start looking at the structure of the process, new things become visible:

repetition,
continuity,
carried context,
and meaningful boundaries between one step and the next.

Those are exactly the kinds of things a trace-based proof system wants to see.

What decode actually means

Let us start with the most important word in this essay:

decode.

In this setting, decode means the token-by-token process by which a transformer generates text.

Suppose the model has already read the prompt:

The cat sat on the

The model now has to decide what comes next.

It might assign high probability to words like:

mat,
floor,
chair.

Suppose we pick mat.

Now the text has changed:

The cat sat on the mat

The model is run again, now on this longer prefix, to decide what should come after mat.

That process repeats:

take the current prefix,
produce the next-token distribution,
choose or sample a token,
append it,
and repeat.

That repeated token-by-token process is decode.

If you only look at the output side, decode seems very simple:

read current text,
predict next token,
repeat.

That is not wrong.

But it hides the deeper structure.

The model is not producing a set of unrelated guesses.

Each step is meaningful because earlier steps have already created a context from which the next step can begin.

That is where state enters the story.

What a state means here

When people hear “state,” they sometimes imagine something mysterious.

They imagine an invisible ghost inside the model, or some vague hidden memory.

That is not what we mean here.

In this essay, a state means:

the information that must be available at one step so the next step is well-defined.

That is all.

A state is not magic.

It is a boundary between one step and the next.

Depending on how you choose to describe execution, that state can include things like:

the current token position,
the internal representation of the current prefix,
cached information derived from previous tokens,
and any commitments or summaries that tell the next step where it is allowed to start.

The exact engineering representation can vary.

The conceptual point does not:

the next token step does not begin from nowhere.

It begins from a state that previous work created.

That is what makes decode a process rather than a collection of isolated predictions.

A useful mental model is this:

a decode step takes in a state,
does one more piece of local work,
emits a visible result,
and produces the next state.

That is already the beginning of a state-transition view.

Decode shape A useful proof boundary treats a decode step as state in, local work, and state out. The token is only the visible part.

State does not have to mean KV cache

There is one technical detail worth getting right early.

People often talk about KV cache when they talk about transformer decode.

KV cache is the saved key-value information from previous tokens. It is usually introduced as a speed optimization. Instead of recomputing everything from the beginning every time, the runtime carries useful work forward.

That is true.

But this essay is not saying that KV cache is the only possible state.

A transformer can also be run by recomputing the whole prefix at each step.

So the claim is not:

KV cache is mathematically required.

The claim is narrower and more useful:

If we choose a proof boundary that carries context forward, decode becomes very naturally described as a transition from one frontier to the next.

That boundary is useful because it matches the way efficient decode is actually run, and it gives the proof system a clear object to check.

Decode is better pictured as a loop

People often talk about transformer inference as if the model were one big function:

text goes in, token comes out.

That picture is fine for many purposes.

For this discussion, it is not the best picture.

A better picture is a loop.

At each step in the loop:

there is a current boundary,
there is carried information from earlier steps,
the model applies the same family of local computations again,
it produces the next token or logits,
and it updates the state for the next step.

Now the model is not one monolithic event.

It is a repeated evolution of state.

That shift matters.

Once you see decode this way, words like transition, continuity, and boundary no longer sound like proof jargon imported from somewhere else.

They sound like descriptions of what the model was already doing.

Two ways to tell the same story

The familiar product story says:

the model reads a prompt and predicts the next token.

The execution story says:

the model starts from a current state, applies one more local transition, produces the next token, and writes forward the next state.

Both stories describe the same system.

Only the second one makes the computational structure fully visible.

That is the one we need.

Decode shape Seen as a loop, decode becomes a repeated transition family instead of one giant black-box event.

The token is the visible part

This is one of the most important shifts.

People say:

the model predicts the next token.

True.

But that sentence points at the most visible output, not at the whole computational object.

A more faithful description is:

the step starts from a current state,
it performs one more piece of repeated local work,
it produces logits and maybe a chosen token,
and it updates the state so the next step can begin.

The step is not just:

token appears.

It is:

state comes in, work happens, state goes out.

If you are trying to reason about proving, that second picture is much more important than the first.

Proof systems care deeply about:

what came in,
what rule was applied,
what came out,
and whether the next step really began from the right place.

The token is the thing a user sees.

The state transition is the thing the proof system wants to talk about.

A moving frontier is the right picture

A helpful visual metaphor is a moving frontier.

At any moment during decode, there is a frontier:

a current position in the sequence,
a current state of the execution,
and a current amount of carried context.

When one more token is generated, that frontier moves forward.

So instead of imagining decode as many independent guesses, imagine it as one process that keeps extending its own frontier.

That picture makes continuity visible.

If the computation were just a set of unrelated predictions, a trace would feel artificial.

But decode is not unrelated predictions.

The current step depends on a carried frontier.

The next step depends on the state the current step leaves behind.

That is exactly the kind of continuity a trace is good at recording.

Decode shape The frontier picture makes carried context explicit: later steps inherit a larger surface from earlier steps.

Repetition is the second half of the story

Continuity alone is not enough.

The second property that matters is repetition.

At each decode step, the exact values change:

the current token position changes,
the active context grows,
the precise numbers inside the computation change.

But the kind of work being done is recognizably the same.

That is crucial.

If every step were fundamentally different, then a row-by-row description would not feel very natural.

The whole point of a trace is that a repeated local rule keeps being applied across many states.

Transformer decode has that flavor.

It is not perfectly uniform in every low-level sense.

But it is regular enough that the family resemblance from one step to the next is real and useful.

So by this point we already have the two ingredients that matter most:

continuity across steps,
and repeated local work.

Those two properties make the next question much less strange:

what if we wrote this process down as a trace?

We are not there yet.

First we need to look at what the repeated local work actually contains.

Decode shape A trace is convincing only if the rows connect. Local correctness without continuity is not a coherent execution.

Attention as repeated work over carried context

At this point, a reader can reasonably ask:

Fine, decode is stateful and repeated. But what is the local work actually made of?

This is where attention matters.

Attention is often introduced in a way that is good enough for practical intuition, but not good enough for structural intuition.

People are told:

queries,
keys,
values,
attention scores,
softmax,
weighted mixing.

That is all true.

But it often lands as a branded mechanism: attention as one named magic block.

For this essay, we want something simpler and more structural.

We want to ask:

What repeated work is attention actually doing?

A good first answer is this:

the current step prepares a useful view of itself,
it compares that view against carried context,
it turns those comparisons into weights,
and it uses those weights to update the next state.

That is still simplified.

But it is much more useful than saying “the model attends.”

It reveals structure.

The current step is not doing magic.

It is doing repeated work over explicit carried context.

That is exactly what we needed to find.

First: form useful views

Attention does not operate on raw state in one undifferentiated lump.

It first turns the current information into forms that can play different roles.

The familiar names are query, key, and value.

You do not need to hold onto the jargon too tightly.

The important thing is this:

the computation reorganizes state into structured views so later comparison and update steps can happen in a disciplined way.

That already gives us a decomposable story.

The block is not one indivisible action.

It is a sequence of smaller jobs.

That matters because proof systems care about decomposition.

A computation that naturally breaks into repeated local substeps is very different from a computation that only makes sense as one fused opaque relation.

AttentionThe first move is not mysterious: the current state is reorganized into query, key, and value views.

Second: compare against carried context

Now the current step has to decide what earlier context matters.

That can be described very simply:

the current query is compared against carried keys,
those comparisons produce scores,
and those scores measure how strongly the current step should attend to different parts of the carried context.

The exact formula is not the main point here.

The structural point is:

the current step is doing repeated comparison work against an explicit memory surface.

That memory surface is not metaphorical.

It is the carried context that earlier steps left behind.

The black-box view says:

the model somehow uses context.

The execution view says:

the current step consults explicit carried context using repeated comparison rules.

That second description is much closer to something you could imagine writing down row by row.

AttentionThe current query is compared against carried keys. This is where previous context becomes explicit.

Third: normalize and mix

After the scores are produced, they are normalized.

AttentionRaw scores are turned into weights. This is one of the places where proving pressure appears.

Then the normalized scores are used to mix carried values into an updated representation.

At a high level, this means:

use the right parts of context.

At a more structural level, it means:

compute a weighted summary over carried state and fold it into the next state.

That second phrasing is the one we want.

It makes the attention step look like what it is from an execution perspective:

start from current state plus carried context,
build structured intermediates,
compare them,
normalize them,
mix carried values,
produce the next state.

That is a sequence.

And sequences can be recorded.

AttentionThe weighted values are folded back into the next state, so the loop can continue.

Where proving pressure appears

This is also where the proving discussion gets interesting.

If transformer computation were only dense linear algebra, the structural story would still matter, but it would be flatter.

Attention and transformer blocks include operations that create more distinctive proving pressure:

normalization,
nonlinear activation functions,
range and quantization behavior,
table-like operations,
and repeated reads from carried context.

This does not mean proving becomes impossible.

It means the workload has personality.

Some parts look like regular arithmetic.

Some parts look like lookup or table work.

Some parts ask for careful boundary choices.

And notice where those parts live.

They live inside the repeated stateful process, not outside it.

That matters.

The interesting proving friction is part of the same loop we have already been following.

From repeated decode to trace

At this point, we have done the most important conceptual work.

We have stopped looking at a transformer as one giant black box.

We have stopped describing decode as a sequence of isolated token guesses.

We have started describing it as a process with three visible properties:

it moves forward step by step,
it carries state from one step to the next,
and it keeps reusing the same family of local computations.

Once those three properties are in view, a natural question appears:

How should we write such a process down?

One answer is: as a trace.

That word can sound more technical than it really is.

A trace is not exotic notation.

It is not a trick invented to make proofs sound complicated.

A trace is simply a careful record of a computation as it unfolds over time.

If you prefer, you can think of it as a disciplined execution notebook.

At each moment, you record the state that entered a step.

You record enough of the local computation to check that the step was valid.

And you record the state that leaves the step.

Then you do the same thing again.

That is all a trace is at a high level.

Once decode is already being viewed as repeated state transition, writing the computation this way starts to feel natural rather than artificial.

TraceInstead of one hidden computation, the run is written down one step at a time.

Why a trace fits decode

Suppose someone told you:

This model generated these tokens, and it did so by correctly evolving its internal computation one step at a time.

How would you want that claim presented?

One possibility would be to compress the whole story into one giant final relation and ask you to trust that all the intermediate consistency is hidden somewhere inside it.

Another possibility would be to say:

here is the state before step 0,
here is what step 0 was allowed to do,
here is the state after step 0,
here is the state before step 1,
here is what step 1 was allowed to do,
here is the state after step 1,
and so on.

That second style is exactly what a trace gives you.

It turns continuity into something visible.

Instead of only asking whether the final answer looks right, it asks whether the process moved from one legitimate state to another legitimate state, over and over again, without breaking the rules.

That is why the trace matters here.

Not because all proof systems must use one.

But because the computational object we have described, repeated decode with carried state and repeated local work, is naturally expressed as an ordered sequence of states.

What a trace row actually is

This is where many readers lose the thread if we go too fast.

So let us ask the simple question:

What is one row of the trace supposed to contain?

The answer is not everything.

A useful trace row is not a giant dump of all hidden model internals.

It is a carefully chosen record of the information needed to support the next check.

At a conceptual level, a row wants to do three jobs:

tell us what state entered the local step,
expose enough of the local computation to check that the step was valid,
tell us what state came out and will be handed to the next step.

That is the core pattern:

state in, local work, state out.

TraceA useful row is not everything inside the model. It records the boundary, enough local work, and the next boundary.

The exact engineering granularity can vary.

A row does not have to mean one human-visible token.

Real proving systems may use many rows for one token step, or group operations in a different way.

But that does not change the conceptual point.

A trace row is where the computation stops being a vague event and becomes an explicit object.

That is why row design matters.

It is not clerical.

It is the place where the computational object becomes concrete.

A beginner’s view of one traced decode step

Imagine the model is at decode step t.

At this moment:

there is a current token frontier,
there is carried context from earlier tokens,
and the current step is about to run the next local block of work.

A trace row for that step wants to capture three boundaries:

the incoming boundary: what state was available at the start,
the local transition: what family of work was applied,
the outgoing boundary: what state was produced for the next step.

That means the row is not the token.

The token may be part of the visible output.

But the row is really about the transition.

This is the same shift we already made earlier.

The token is the visible part.

The state transition is the object the proof system wants to check.

Why continuity matters as much as the local step

A row by itself is not enough.

If all you did was verify that each row looked locally reasonable, you would still be missing something essential.

You would not yet know whether the rows formed one coherent execution.

That is where continuity enters.

The output state of one step must really be the input state of the next.

That sounds obvious in plain language, but it is one of the most important ideas in the whole story.

A valid execution is not just:

many locally valid steps.

It is:

many locally valid steps,
connected by genuine continuity.

This is where the phrase carried state stops being conceptual and becomes enforceable.

It is one thing to say:

state is carried forward.

It is another thing to say:

the state claimed to leave row t is exactly the state that row t + 1 claims to begin from.

That is a much sharper statement.

It is exactly the kind of statement a trace makes possible.

TraceThe continuity links turn separate local checks into one coherent execution.

The trace is not just a pretty drawing

At first glance, a trace can look like a different presentation of the same computation.

In one sense, that is true.

The computation itself has not changed just because we chose to write it down differently.

But something important has changed.

The trace makes certain things explicit that would otherwise remain hidden inside one large object.

In the trace view, you can point directly to:

where the execution begins,
what state is present at each step,
what local rule is being applied,
how continuity is maintained,
and what boundary the whole computation is claiming to reach.

That is not cosmetic.

It changes what a proof system can naturally talk about.

The trace does not invent continuity.

It exposes continuity.

And because decode already depends on continuity, that is exactly what we want.

Why STARK-style constraints fit this shape

Once the trace is in view, the next question becomes much easier to ask.

What does a STARK-style proof system want from a computation?

At a high level, it wants three things:

a sequence of states,
local rules explaining valid evolution,
and explicit conditions at the boundaries.

That should now sound familiar.

It is almost exactly the language we have already been using to describe decode.

This is why the structural-fit argument is not mainly about branding or benchmark tribalism.

It is about constraint shape.

A trace-based STARK-style system naturally asks questions like:

did each step evolve according to the right local rule?
did the whole run begin from the right state?
did it end at the state it claims to have reached?

Those are exactly the questions we care about once decode has been described as a repeated stateful process.

Local transition constraints

Let us make local rule more concrete without becoming too formal.

A local transition constraint is a disciplined way of saying:

if this is the state entering the step, and the local step is supposed to follow this transition family, then the next state must look like that.

That is all.

The implementation details can be sophisticated.

But the conceptual object is simple.

A local transition rule checks that one step legitimately follows from the previous one.

In our running story, that means:

the step begins from a valid carried boundary,
the repeated local kernels are applied in the allowed way,
and the next boundary is the one the step was entitled to produce.

This is why repeated structure matters so much.

If every step were fundamentally different, local transition rules would be much less natural.

But transformer decode keeps reusing the same kind of step with different concrete values.

That is exactly the setting in which local transition reasoning becomes believable.

STARK constraintsThe middle of the trace asks the same question again and again: did this row evolve by the allowed rule?

Boundary constraints

A repeated transition story still needs a beginning and an end.

Where does the execution start?

What initial boundary is the first step allowed to assume?

What final state or output boundary is the whole computation claiming to have reached?

Those are boundary questions.

They are not imported from outside the workload.

They are already present in the ordinary execution story:

start from this prompt and this frontier,
run the decode loop,
end with this output and this resulting frontier.

Boundary constraints are the proof-language version of that same story.

They matter because they define what is actually being certified.

Without explicit boundaries, the statement of the proof becomes muddy.

With explicit boundaries, the proof object can remain narrow and meaningful:

this execution started here,
followed these transition rules,
and ended there.

That is cleaner than saying “prove the whole model” as an undifferentiated slogan.

STARK constraintsThe edges define the public claim: this run started here and ended there.

This first essay only established the shape.

The next question is where the proving pressure actually appears once that shape is visible.

Continue to Part 2: Where Transformer Proof Pressure Appears.