TL;DR
- Transformer decode is a repeated state transition over carried context.
- That repeated shape maps naturally to STARK-style traces.
- This does not mean STARKs automatically beat every zkML system; it means the proof-boundary question becomes concrete.
- The next questions are proof pressure and statement validity: where lookup-heavy transformer work accumulates, and what the verified artifact is allowed to mean.
- The current research evidence now lives in the next essays: proof pressure, statement validity, and typed proof boundaries.
Series part 1 of 4
A transformer decoder does not run once.
It takes a state, does local work, writes a new state, and carries context forward.
That is the bridge:
- state in,
- local transition,
- state out,
- and carried context.
Once you see that shape, the STARK connection becomes much easier to draw.
But this essay is not arguing that STARKs automatically win zkML.
The narrower claim is more useful: transformer decode exposes a repeated state-transition surface, and that makes the proof-boundary question concrete.
This is a live research series. Part 1 is the intuition layer. The next essays make the same picture more concrete: where the proof work gets heavy, what a proof is allowed to mean, and how typed boundaries keep proof artifacts from drifting.
The road is simple.
First, see transformer decode as repeated state transition.
Then ask where the proof actually gets heavy.
Then ask what a verified proof is allowed to mean.
Finally, name the typed receipt around the proof artifact.
That order matters. If the trace picture is fuzzy, the proof-boundary argument feels abstract. If the receipt problem is fuzzy, the word Tablero arrives too early.
The bridge
At first, transformers and STARKs look like they belong to different worlds.
A transformer is usually introduced as a model that predicts the next token.
You give it some text. It reads that text. It produces a probability distribution over what should come next. Then, once one token is chosen, the process repeats.
I wrote an older Spanish transformer-intuition notebook in 2021. It explains the usual attention story: query, key, value, multi-head attention, and the model-facing mechanics.
This essay starts one level lower.
It asks what proof shape that computation exposes.
A STARK is usually introduced as a proof system for computation.
You describe a computation as a sequence of states, define rules for how one state may turn into the next, and prove the sequence obeyed those rules.
In STARK language, the written version of that run is called a trace.
A trace is a table of execution. Each row records enough state to check one step of the computation.
The important part is the handoff. The state produced by one row has to be the state used by the next row. When those handoffs line up, the rows describe one continuous run.
A concrete example is Cairo. You write a normal-looking program, but the Cairo VM runs lower-level instructions and produces a row-by-row trace of machine state.
The proof is not just:
“the function returned 5, trust me.”
It is closer to:
“the machine started here, followed valid steps, and ended there.”
That is the shift we need for transformers too. The user sees tokens; the proof system wants the state transitions that produced them.
The core idea of this essay is this:
The short version
Here is the bridge in the smallest possible form: decode extends a prefix one step at a time; a trace checks a computation one row at a time.
The names differ, but the shape is the same: ordered steps, with each step beginning where the previous step ended.
Take step 2. In the decode view, the text before the step is The cat sat. The transformer runs once and produces the next visible token, on. After the step, the visible text has become The cat sat on.
The trace view records the same event with different names. The computation begins from state 2, applies the decode rule, and ends at state 3.
The token on is what the reader sees. The state transition is what the proof system checks.
If we choose the state carefully, the token-by-token process can be written as a row-by-row execution record.
Start with the wrong picture
The wrong picture is this:
A transformer is one giant black box.
A proof system is some wrapper around it.
The only interesting question is whether the wrapper is powerful enough to handle the black box.
That picture is too blunt.
It hides the exact thing that matters most: the internal shape of the computation.
This is why I like Percepta’s “Can LLMs Be Computers?” as a starting point.
Percepta asked whether transformers can be computers. Abdel’s follow-up asks whether those transformer-computers can be proven. This series asks a neighboring question: if transformer computation is trace-shaped, where should STARK-native proof boundaries live?
That framing shifts the reader away from “language model as mysterious oracle” and toward “language model as a process that can execute structured work.”
That is the right first move.
But for this essay, the question is narrower:
if a transformer is a process, what proof surface does that process naturally expose?
A proof system is not neutral about shape.
Some proof systems are more naturally expressed when a computation is flattened into one large algebraic relation.
Others feel more natural when the computation can be written as:
- a sequence of states,
- a repeated local rule,
- and explicit conditions on how the process begins and ends.
So before asking “can we prove a transformer?”, it helps to ask a better question:
What kind of computation is a transformer performing during decode?
That shift is small in wording, but very large in effect.
Once you stop treating the model as one black box and start looking at the structure of the process, new things become visible:
- repetition,
- continuity,
- carried context,
- and meaningful boundaries between one step and the next.
Those are exactly the kinds of things a trace-based proof system wants to see.
What decode actually means
Let us start with the most important word in this essay:
decode.
In this setting, decode means the token-by-token process by which a transformer generates text.
Suppose the model has already read the prompt:
The cat sat on the
The model now has to decide what comes next.
It might assign high probability to words like:
mat,floor,chair.
Suppose we pick mat.
Now the text has changed:
The cat sat on the mat
The model is run again, now on this longer prefix, to decide what should come after mat.
That process repeats:
- take the current prefix,
- produce the next-token distribution,
- choose or sample a token,
- append it,
- and repeat.
That repeated token-by-token process is decode.
If you only look at the output side, decode seems very simple:
- read current text,
- predict next token,
- repeat.
That is not wrong.
But it hides the deeper structure.
The model is not producing a set of unrelated guesses.
Each step is meaningful because earlier steps have already created a context from which the next step can begin.
That is where state enters the story.
What a state means here
In this essay, a state means:
the information that must be available at one step so the next step is well-defined.
It is a boundary between one step and the next.
Depending on how you choose to describe execution, that state can include things like:
- the current token position,
- the internal representation of the current prefix,
- cached information derived from previous tokens,
- and any commitments or summaries that tell the next step where it is allowed to start.
The next token step does not begin from nowhere.
It begins from a state that previous work created.
That is what makes decode a process rather than a collection of isolated predictions.
A useful mental model is this:
- a decode step takes in a state,
- does one more piece of local work,
- emits a visible result,
- and produces the next state.
That is already the beginning of a state-transition view.
State does not have to mean KV cache
People often talk about KV cache when they talk about transformer decode.
KV cache is the saved key-value information from previous tokens. It is usually introduced as a speed optimization. Instead of recomputing everything from the beginning every time, the runtime carries useful work forward.
But this essay is not saying that KV cache is the only possible state.
A transformer can also be run by recomputing the whole prefix at each step.
So the claim is not:
KV cache is mathematically required.
The claim is narrower and more useful:
If we choose a proof boundary that carries context forward, decode becomes very naturally described as a transition from one frontier to the next.
That boundary is useful because it matches the way efficient decode is actually run, and it gives the proof system a clear object to check.
Decode is better pictured as a loop
People often talk about transformer inference as if the model were one big function:
text goes in, token comes out.
That picture is fine for many purposes.
For this discussion, a better picture is a loop.
At each step in the loop:
- there is a current boundary,
- there is carried information from earlier steps,
- the model applies the same family of local computations again,
- it produces the next token or logits,
- and it updates the state for the next step.
Now the model is not one monolithic event. It is a repeated evolution of state.
Two ways to tell the same story
The familiar product story says:
the model reads a prompt and predicts the next token.
The execution story says:
the model starts from a current state, applies one more local transition, produces the next token, and writes forward the next state.
Both stories describe the same system. Only the second one makes the computational structure visible.
The token is the visible part
People say:
the model predicts the next token.
True.
But that sentence points at the most visible output, not at the whole computational object.
A more faithful description is:
- the step starts from a current state,
- it performs one more piece of repeated local work,
- it produces logits and maybe a chosen token,
- and it updates the state so the next step can begin.
The step is not just:
token appears.
It is:
state comes in, work happens, state goes out.
The token is the thing a user sees. The state transition is the thing the proof system wants to check.
A moving frontier is the right picture
A helpful visual metaphor is a moving frontier.
At any moment during decode, there is a frontier:
- a current position in the sequence,
- a current state of the execution,
- and a current amount of carried context.
When one more token is generated, that frontier moves forward.
So instead of imagining decode as many independent guesses, imagine one process extending its own frontier.
The current step depends on a carried frontier. The next step depends on the state the current step leaves behind. That is exactly the continuity a trace records.
Repetition is the second half of the story
Continuity alone is not enough. The second property is repetition.
At each decode step, the exact values change:
- the current token position changes,
- the active context grows,
- the precise numbers inside the computation change.
But the kind of work being done is recognizably the same.
If every step were fundamentally different, a row-by-row description would feel artificial. A trace is useful when a repeated local rule keeps being applied across many states.
Transformer decode has that flavor.
It is not perfectly uniform in every low-level sense.
But it is regular enough that the family resemblance from one step to the next is real and useful.
So by this point we have the two ingredients that matter most:
- continuity across steps,
- and repeated local work.
Those two properties make the next question natural:
what if we wrote this process down as a trace?
First, we need to look at what the repeated local work contains.
Attention as repeated work over carried context
Now ask:
What repeated work is attention actually doing?
A good first answer is this:
- the current step prepares a useful view of itself,
- it compares that view against carried context,
- it turns those comparisons into weights,
- and it uses those weights to update the next state.
That is simplified, but it reveals the structure: attention is repeated work over explicit carried context.
First: form useful views
Attention first turns current information into forms that can play different roles. The familiar names are query, key, and value.
The useful point is this:
the computation reorganizes state into structured views so later comparison and update steps can happen in a disciplined way.
That gives us a decomposable story. The block is not one indivisible action; it is a sequence of smaller jobs.
Second: compare against carried context
Now the current step decides what earlier context matters:
- the current query is compared against carried keys,
- those comparisons produce scores,
- and those scores measure how strongly the current step should attend to different parts of the carried context.
The structural point is:
the current step is doing repeated comparison work against an explicit memory surface.
That memory surface is the carried context earlier steps left behind. This is no longer “the model somehow uses context”; it is a repeated comparison rule over a table of carried keys.
Third: normalize and mix
After the scores are produced, they are normalized into weights.
Then the weights select value rows. Their weighted sum becomes a context vector that is folded into the next state.
From an execution perspective, the attention step is:
- start from current state plus carried context,
- build structured intermediates,
- compare them,
- normalize them,
- mix carried values into a context vector,
- carry that context into the next state.
Where proving pressure appears
If transformer computation were only dense linear algebra, the structural story would be flatter.
Attention and transformer blocks include operations that create more distinctive proving pressure:
- normalization,
- nonlinear activation functions,
- range and quantization behavior,
- table-like operations,
- and repeated reads from carried context.
This does not mean proving becomes impossible. It means the workload has shape: arithmetic, lookup-like work, range behavior, and boundary choices all live inside the same repeated loop.
Why the proof boundary matters
Here is where the current research starts, at the beginner level.
Imagine one attention step has two related jobs:
- check the arithmetic that produced the scores and mixed the values;
- check that the table-like work used the allowed rows, ranges, and policy.
One design can prove those jobs as separate objects. That can be valid, but it may repeat the surrounding machinery: commit to a trace, open values, and convince a verifier that each object is the one being claimed.
Another design can put the related work inside one proof boundary.
The lookup work does not disappear. The arithmetic does not disappear. The question is whether the wrapper around that work has to be repeated.
That is the direction we are testing now: not just whether transformer pieces can be proved, but where the proof boundary should live.
This is still not a benchmark claim. It is why the trace framing matters. Once the computation has shape, you can move the boundary and ask what becomes simpler.
From repeated decode to trace
By now, decode is no longer a black box or a sequence of isolated guesses. It is a process with three visible properties:
- it moves forward step by step,
- it carries state from one step to the next,
- and it keeps reusing the same family of local computations.
Once those properties are visible, the natural representation is a trace.
A trace records the state entering a step, enough local work to check the step, and the state leaving the step. Then it repeats that record for the next step.
Why a trace fits decode
Suppose someone claims:
This model generated these tokens by correctly evolving its internal computation one step at a time.
The trace-style version says:
- here is the state before step 0,
- here is what step 0 was allowed to do,
- here is the state after step 0,
- here is the state before step 1,
- here is what step 1 was allowed to do,
- here is the state after step 1,
- and so on.
That is what a trace gives you: not only the final answer, but the checked path that got there.
What a trace row actually is
One trace row should not dump every hidden model internal. It should record the information needed to support the next check.
At a conceptual level, a row wants to do three jobs:
- tell us what state entered the local step,
- expose enough of the local computation to check that the step was valid,
- tell us what state came out and will be handed to the next step.
That is the core pattern:
state in, local work, state out.
That pattern only becomes useful when it stops being a slogan and turns into columns the verifier can read.
A row does not have to mean one human-visible token. Real proving systems may use many rows for one token step, or group operations differently. The point is that row design is where the computational object becomes concrete.
A beginner’s view of one traced decode step
At decode step t, a useful row captures three boundaries:
- the incoming boundary: what state was available at the start,
- the local transition: what family of work was applied,
- the outgoing boundary: what state was produced for the next step.
The row is not the token. The token is visible output; the row is the transition the proof system checks.
Why continuity matters as much as the local step
A row by itself is not enough. The output state of one step must really be the input state of the next.
This is where carried state becomes enforceable:
the state claimed to leave row
tis exactly the state that rowt + 1claims to begin from.
Why STARK-style constraints fit this shape
At a high level, a STARK-style proof wants three things:
- a sequence of states,
- local rules explaining valid evolution,
- and explicit conditions at the boundaries.
That is the same language we have been using for decode. The structural-fit argument is about constraint shape, not branding.
A trace-based STARK-style system naturally asks questions like:
- did each step evolve according to the right local rule?
- did the whole run begin from the right state?
- did it end at the state it claims to have reached?
Those are exactly the questions we care about once decode has been described as a repeated stateful process.
Local transition constraints
A local transition constraint is a disciplined way of saying:
if this is the state entering the step, and the local step is supposed to follow this transition family, then the next state must look like that.
A local transition rule checks that one step legitimately follows from the previous one.
In our running story, that means:
- the step begins from a valid carried boundary,
- the repeated local kernels are applied in the allowed way,
- and the next boundary is the one the step was entitled to produce.
This is why repeated structure matters: transformer decode keeps reusing the same kind of step with different concrete values.
Boundary constraints
Boundary constraints answer the public-claim questions:
- start from this prompt and this frontier,
- run the decode loop,
- end with this output and this resulting frontier.
Without explicit boundaries, the statement of the proof becomes muddy. With explicit boundaries, the proof object can remain narrow and meaningful:
- this execution started here,
- followed these transition rules,
- and ended there.
That is cleaner than saying “prove the whole model” as an undifferentiated slogan. It is also the lesson behind the receipt work in this series: the proof has to bind the computation and the statement it is supposed to certify.
Put together, the claim is not just “there is a table” or “each row looks valid.” The claim needs all three checks at once.
The public statement pins the beginning and end. The local constraints check each row. The continuity constraints make neighboring rows one run instead of valid-looking fragments.
This is not just a wording preference.
In later essays, this boundary language becomes testable.
In the current proof-pressure experiments, the most interesting signal is not “we proved a full model.”
We have not.
The signal is about boundary placement. Some lookup-heavy attention work appears to fit better when the arithmetic and the table checks are accepted as one related proof object instead of separate objects with separate wrappers.
That is the direction Part 2 studies.
In a separate set of small adapter experiments, raw proof verification and application-level statement validity separated cleanly.
The raw proof path could answer a narrow question:
do these proof files verify against this verifier surface?
But the user-facing AI claim needs more:
this model, on this input, under this policy, produced this output, in this verifier domain.
When fields like model label, input label, output label, policy label, verifier domain, or artifact identity lived outside the accepted statement, relabeling stayed possible at the product boundary. When those fields were wrapped into a statement envelope, the same mutations became visible.
That is what the next essays make concrete.
Where the field is moving
This framing is also closer to where zkML research is moving in 2026.
NANOZK pushes layerwise transformer proofs instead of treating the model as one opaque circuit.
Jolt Atlas treats ONNX tensor traces as the proving surface instead of forcing everything through a generic CPU trace.
zkLLM and zkAttn specialize the attention mechanism and tensor lookup work.
DeepProve-1 frames production zkML around graph structure, autoregressive state, quantization, Softmax, LayerNorm, GELU, embeddings, and QKV reuse.
The common direction is not “wrap a model and hope.”
It is proof-system-aware transformer architecture: choose boundaries around the work the proof system actually sees.
This series takes the STARK-native side of that question.
It is not yet a benchmark victory over systems like NANOZK, Jolt Atlas, or DeepProve.
It is a smaller claim:
if transformer work is trace-shaped, then STARK-native proof boundaries may be able to share commitment, opening, lookup, and statement plumbing that generic wrappers duplicate.
Next
This first essay established the shape.
Part 2 asks where the proof work actually concentrates.
The rest of the series
Read the next pieces in order. They turn the trace picture into checked proof-boundary evidence and then into statement-validity discipline.
- 02
Proof Pressure Is Not Just Matrix Multiply
Where lookup-heavy work changes the proof shape.
- 03
Proof Validity Is Not Statement Validity
The claim has to be bound to the proof.
- 04
What a Proof Is Allowed to Mean
The typed receipt around a proof artifact.