Essay

Why Transformers Fit STARKs

Transformers already look trace-shaped if you look at decode the right way: repeated local work, carried state, and boundaries that a proof system has to choose carefully.

Transformer decode is a repeated state transition over carried context. That shape does not make STARKs win automatically, but it makes the proof-boundary question natural.

Omar U. Espejel (Starknet Foundation) and Abdel Stark (StarkWare)
  • Omar U. Espejel: Former machine learning engineer at Hugging Face; now working on crypto and AI at Starknet Foundation.
  • Abdel Stark: Head of Applied AI and Verifiable Intelligence at StarkWare, working on auditable agent systems and proof-backed AI.
  • transformers
  • starks
  • zkml
  • trace systems

TL;DR

  • Transformer decode is a repeated state transition over carried context.
  • That repeated shape maps naturally to STARK-style traces.
  • This does not mean STARKs automatically beat every zkML system; it means the proof-boundary question becomes concrete.
  • The next questions are proof pressure and statement validity: where lookup-heavy transformer work accumulates, and what the verified artifact is allowed to mean.
  • The current research evidence now lives in the next essays: proof pressure, statement validity, and typed proof boundaries.

Series part 1 of 4

Transformers fit STARK-style thinking because decode already looks like a trace: one stateful step after another.

A transformer decoder does not run once.

It takes a state, does local work, writes a new state, and carries context forward.

That is the bridge:

  • state in,
  • local transition,
  • state out,
  • and carried context.

Once you see that shape, the STARK connection becomes much easier to draw.

But this essay is not arguing that STARKs automatically win zkML.

The narrower claim is more useful: transformer decode exposes a repeated state-transition surface, and that makes the proof-boundary question concrete.

This is a live research series. Part 1 is the intuition layer. The next essays make the same picture more concrete: where the proof work gets heavy, what a proof is allowed to mean, and how typed boundaries keep proof artifacts from drifting.

The road is simple.

First, see transformer decode as repeated state transition.

Then ask where the proof actually gets heavy.

Then ask what a verified proof is allowed to mean.

Finally, name the typed receipt around the proof artifact.

That order matters. If the trace picture is fuzzy, the proof-boundary argument feels abstract. If the receipt problem is fuzzy, the word Tablero arrives too early.

The bridge

At first, transformers and STARKs look like they belong to different worlds.

A transformer is usually introduced as a model that predicts the next token.

You give it some text. It reads that text. It produces a probability distribution over what should come next. Then, once one token is chosen, the process repeats.

I wrote an older Spanish transformer-intuition notebook in 2021. It explains the usual attention story: query, key, value, multi-head attention, and the model-facing mechanics.

This essay starts one level lower.

It asks what proof shape that computation exposes.

Transformer and STARK execution shapes A simple static diagram explaining one part of the transformer to STARK execution-record bridge. The transformer story During decode, the same transformer is run again and again. current text same transformer next token one transformer, reused The transformer cat step 0 The cat transformer computes step 1 The cat sat transformer on step 2 The cat sat on transformer the step 3 The cat sat on the transformer mat step 4
Transformer view This is the familiar product picture repeated over time: current text goes in, the same transformer runs, and one more token comes out.

A STARK is usually introduced as a proof system for computation.

You describe a computation as a sequence of states, define rules for how one state may turn into the next, and prove the sequence obeyed those rules.

In STARK language, the written version of that run is called a trace.

A trace is a table of execution. Each row records enough state to check one step of the computation.

The important part is the handoff. The state produced by one row has to be the state used by the next row. When those handoffs line up, the rows describe one continuous run.

A concrete example is Cairo. You write a normal-looking program, but the Cairo VM runs lower-level instructions and produces a row-by-row trace of machine state.

The proof is not just:

“the function returned 5, trust me.”

It is closer to:

“the machine started here, followed valid steps, and ended there.”

That is the shift we need for transformers too. The user sees tokens; the proof system wants the state transitions that produced them.

Transformer and STARK execution shapes A simple static diagram explaining one part of the transformer to STARK execution-record bridge. The STARK story A trace is a table. Each row checks one small move in the run. row state before rule checked state after row 0 state 0 local rule state 1 row 1 state 1 local rule state 2 row 2 state 2 local rule state 3 row 3 state 3 local rule state 4 row 4 state 4 local rule state 5 same local rule, reused new states each row, same check shape handoff state after becomes the next state before
Trace view This is the proof-system picture: the run is many rows, and each row checks that one state is allowed to become the next.

The core idea of this essay is this:

The short version

Transformer decode already has the kind of shape that a trace-based proof system likes: repeated local work, carried context, and clear boundaries between one step and the next.

Here is the bridge in the smallest possible form: decode extends a prefix one step at a time; a trace checks a computation one row at a time.

The names differ, but the shape is the same: ordered steps, with each step beginning where the previous step ended.

Transformer and STARK execution shapes A simple static diagram explaining one part of the transformer to STARK execution-record bridge. The bridge The repeated decode loop can be written down as repeated trace rows. decode step trace row five decode steps five trace rows same order, different notation step 0 The -> cat row 0 state 0 -> state 1 step 1 The cat -> sat row 1 state 1 -> state 2 step 2 The cat sat -> on row 2 state 2 -> state 3 step 3 The cat sat on -> the row 3 state 3 -> state 4 step 4 The cat sat on the -> mat row 4 state 4 -> state 5 write as write as write as write as write as
Bridge view The bridge is row-by-row: each decode step can be recorded as a checkable trace row.

Take step 2. In the decode view, the text before the step is The cat sat. The transformer runs once and produces the next visible token, on. After the step, the visible text has become The cat sat on.

The trace view records the same event with different names. The computation begins from state 2, applies the decode rule, and ends at state 3.

The token on is what the reader sees. The state transition is what the proof system checks.

One decode step becomes one trace row One line of the bridge, zoomed in: the decode step is the human view, and the trace row is the prover view of the same transition. One line of the bridge, zoomed in The token story and the trace story are two views of the same transition. decode view before the step The cat sat transformer runs once after the step The cat sat on visible token: on trace view row 2 state 2 decode rule state 3 includes the before boundary checks the step was valid becomes next row's input
Bridge close-upOne line of the bridge, zoomed in: the decode step is the human view, and the trace row is the prover view of the same transition.

If we choose the state carefully, the token-by-token process can be written as a row-by-row execution record.

Start with the wrong picture

The wrong picture is this:

A transformer is one giant black box.

A proof system is some wrapper around it.

The only interesting question is whether the wrapper is powerful enough to handle the black box.

That picture is too blunt.

It hides the exact thing that matters most: the internal shape of the computation.

This is why I like Percepta’s “Can LLMs Be Computers?” as a starting point.

Percepta asked whether transformers can be computers. Abdel’s follow-up asks whether those transformer-computers can be proven. This series asks a neighboring question: if transformer computation is trace-shaped, where should STARK-native proof boundaries live?

That framing shifts the reader away from “language model as mysterious oracle” and toward “language model as a process that can execute structured work.”

That is the right first move.

But for this essay, the question is narrower:

if a transformer is a process, what proof surface does that process naturally expose?

A proof system is not neutral about shape.

Black-box wrapper versus execution surface A diagram contrasting the black-box proof-wrapper picture with the trace-facing execution-surface picture. The better question is shape Do not start with "can a proof wrapper swallow the model?" Start with the execution surface. less useful picture one giant black box question: can the wrapper handle it? more useful picture state in local rule state out repeat this shape row after row question: what does execution expose? The structural fit comes from the second picture, not from proof-system branding.
Proof surface The useful framing is not whether a wrapper can hide a model. It is what state-transition surface the computation exposes.

Some proof systems are more naturally expressed when a computation is flattened into one large algebraic relation.

Others feel more natural when the computation can be written as:

  • a sequence of states,
  • a repeated local rule,
  • and explicit conditions on how the process begins and ends.

So before asking “can we prove a transformer?”, it helps to ask a better question:

What kind of computation is a transformer performing during decode?

That shift is small in wording, but very large in effect.

Once you stop treating the model as one black box and start looking at the structure of the process, new things become visible:

  • repetition,
  • continuity,
  • carried context,
  • and meaningful boundaries between one step and the next.

Those are exactly the kinds of things a trace-based proof system wants to see.

What decode actually means

Let us start with the most important word in this essay:

decode.

In this setting, decode means the token-by-token process by which a transformer generates text.

Suppose the model has already read the prompt:

The cat sat on the

The model now has to decide what comes next.

It might assign high probability to words like:

  • mat,
  • floor,
  • chair.

Suppose we pick mat.

Now the text has changed:

The cat sat on the mat

The model is run again, now on this longer prefix, to decide what should come after mat.

That process repeats:

  • take the current prefix,
  • produce the next-token distribution,
  • choose or sample a token,
  • append it,
  • and repeat.

That repeated token-by-token process is decode.

If you only look at the output side, decode seems very simple:

  1. read current text,
  2. predict next token,
  3. repeat.

That is not wrong.

But it hides the deeper structure.

The model is not producing a set of unrelated guesses.

Each step is meaningful because earlier steps have already created a context from which the next step can begin.

That is where state enters the story.

What a state means here

In this essay, a state means:

the information that must be available at one step so the next step is well-defined.

It is a boundary between one step and the next.

Depending on how you choose to describe execution, that state can include things like:

  • the current token position,
  • the internal representation of the current prefix,
  • cached information derived from previous tokens,
  • and any commitments or summaries that tell the next step where it is allowed to start.

The next token step does not begin from nowhere.

It begins from a state that previous work created.

That is what makes decode a process rather than a collection of isolated predictions.

A useful mental model is this:

  • a decode step takes in a state,
  • does one more piece of local work,
  • emits a visible result,
  • and produces the next state.

That is already the beginning of a state-transition view.

Decode as repeated state transition A static diagram describing transformer decode as state in, repeated local work, token output, and state out. One decode step The token is visible. The row is the object a proof has to reason about. row state in local transition visible token state out row t one step frontier t prefix + KV cache attention → MLP normalization, ranges "on" frontier t + 1 what the table makes clear The token is just one cell. The proof-facing object is the whole row. binds input boundary checks local work emits next boundary
Decode shape A useful proof boundary treats a decode step as state in, local work, and state out. The token is only the visible part.

State does not have to mean KV cache

People often talk about KV cache when they talk about transformer decode.

KV cache is the saved key-value information from previous tokens. It is usually introduced as a speed optimization. Instead of recomputing everything from the beginning every time, the runtime carries useful work forward.

But this essay is not saying that KV cache is the only possible state.

A transformer can also be run by recomputing the whole prefix at each step.

So the claim is not:

KV cache is mathematically required.

The claim is narrower and more useful:

If we choose a proof boundary that carries context forward, decode becomes very naturally described as a transition from one frontier to the next.

That boundary is useful because it matches the way efficient decode is actually run, and it gives the proof system a clear object to check.

Decode as repeated state transition A static diagram describing transformer decode as state in, repeated local work, token output, and state out. State is a boundary choice The essay does not depend on KV cache. It depends on making the inherited frontier explicit. valid but heavier boundary full prefix tokens 0..t recompute context works, but repeats more next prefix tokens 0..t+1 carried-frontier boundary frontier t position + carried summary extend boundary checks inherited state frontier t+1 next allowed start The proof boundary must say what gets inherited. KV cache is one implementation of that idea, not the idea itself.
Decode shape State is the inherited boundary between steps. KV cache is a useful example, not the only possible boundary.

Decode is better pictured as a loop

People often talk about transformer inference as if the model were one big function:

text goes in, token comes out.

That picture is fine for many purposes.

For this discussion, a better picture is a loop.

At each step in the loop:

  1. there is a current boundary,
  2. there is carried information from earlier steps,
  3. the model applies the same family of local computations again,
  4. it produces the next token or logits,
  5. and it updates the state for the next step.

Now the model is not one monolithic event. It is a repeated evolution of state.

Two ways to tell the same story

The familiar product story says:

the model reads a prompt and predicts the next token.

The execution story says:

the model starts from a current state, applies one more local transition, produces the next token, and writes forward the next state.

Both stories describe the same system. Only the second one makes the computational structure visible.

Decode as repeated state transition A static diagram describing transformer decode as state in, repeated local work, token output, and state out. Decode is a loop, not one event The prefix changes, but the same transition shape is reused at every token step. step state before same decode work visible token state after step 0 state 0 transformer cat state 1 step 1 state 1 transformer sat state 2 step 2 state 2 transformer on state 3 step 3 state 3 transformer the state 4 step 4 state 4 transformer mat state 5 same transition family, reused
Decode shape Seen as a loop, decode becomes a repeated transition family instead of one giant black-box event.

The token is the visible part

People say:

the model predicts the next token.

True.

But that sentence points at the most visible output, not at the whole computational object.

A more faithful description is:

  • the step starts from a current state,
  • it performs one more piece of repeated local work,
  • it produces logits and maybe a chosen token,
  • and it updates the state so the next step can begin.

The step is not just:

token appears.

It is:

state comes in, work happens, state goes out.

The token is the thing a user sees. The state transition is the thing the proof system wants to check.

A moving frontier is the right picture

A helpful visual metaphor is a moving frontier.

At any moment during decode, there is a frontier:

  • a current position in the sequence,
  • a current state of the execution,
  • and a current amount of carried context.

When one more token is generated, that frontier moves forward.

So instead of imagining decode as many independent guesses, imagine one process extending its own frontier.

The current step depends on a carried frontier. The next step depends on the state the current step leaves behind. That is exactly the continuity a trace records.

Decode as repeated state transition A static diagram describing transformer decode as state in, repeated local work, token output, and state out. The frontier moves forward The visible sequence grows, and the carried surface grows with it. t0 t1 t2 t3 next carried context larger frontier
Decode shape The frontier picture makes carried context explicit: later steps inherit a larger surface from earlier steps.

Repetition is the second half of the story

Continuity alone is not enough. The second property is repetition.

At each decode step, the exact values change:

  • the current token position changes,
  • the active context grows,
  • the precise numbers inside the computation change.

But the kind of work being done is recognizably the same.

If every step were fundamentally different, a row-by-row description would feel artificial. A trace is useful when a repeated local rule keeps being applied across many states.

Transformer decode has that flavor.

It is not perfectly uniform in every low-level sense.

But it is regular enough that the family resemblance from one step to the next is real and useful.

So by this point we have the two ingredients that matter most:

  • continuity across steps,
  • and repeated local work.

Those two properties make the next question natural:

what if we wrote this process down as a trace?

First, we need to look at what the repeated local work contains.

Decode as repeated state transition A static diagram describing transformer decode as state in, repeated local work, token output, and state out. Continuity is the claim The next row must really begin from the state the previous row produced. row state in local rule state out continuity check row t state A decode state B starts the chain row t+1 state B decode state C B = B row t+2 state C decode state D C = C state out in one row must equal state in in the next row
Decode shape A trace is convincing only if the rows connect. Local correctness without continuity is not a coherent execution.

Attention as repeated work over carried context

Now ask:

What repeated work is attention actually doing?

A good first answer is this:

  • the current step prepares a useful view of itself,
  • it compares that view against carried context,
  • it turns those comparisons into weights,
  • and it uses those weights to update the next state.

That is simplified, but it reveals the structure: attention is repeated work over explicit carried context.

First: form useful views

Attention first turns current information into forms that can play different roles. The familiar names are query, key, and value.

The useful point is this:

the computation reorganizes state into structured views so later comparison and update steps can happen in a disciplined way.

That gives us a decomposable story. The block is not one indivisible action; it is a sequence of smaller jobs.

Form useful views The first move is not mysterious: the current state is reorganized into query, key, and value views. Form useful views Attention starts by giving the current state different roles. current state query key value same state, three roles query asks key is compared value is mixed later
AttentionThe first move is not mysterious: the current state is reorganized into query, key, and value views.

Second: compare against carried context

Now the current step decides what earlier context matters:

  • the current query is compared against carried keys,
  • those comparisons produce scores,
  • and those scores measure how strongly the current step should attend to different parts of the carried context.

The structural point is:

the current step is doing repeated comparison work against an explicit memory surface.

That memory surface is the carried context earlier steps left behind. This is no longer “the model somehow uses context”; it is a repeated comparison rule over a table of carried keys.

Compare with carried context The current query is compared against carried keys. This is where previous context becomes explicit. Compare with carried context The current query is compared with several keys carried from earlier tokens. query now carried keys key from The key from cat key from sat key from on scores one score per carried key
AttentionThe current query is compared against carried keys. This is where previous context becomes explicit.

Third: normalize and mix

After the scores are produced, they are normalized into weights.

Normalize scores Raw scores are turned into weights. This is one of the places where proving pressure appears. Normalize scores Raw comparisons become weights over the carried context. raw scores normalize weights same positions, now weighted
AttentionRaw scores are turned into weights. This is one of the places where proving pressure appears.

Then the weights select value rows. Their weighted sum becomes a context vector that is folded into the next state.

From an execution perspective, the attention step is:

  • start from current state plus carried context,
  • build structured intermediates,
  • compare them,
  • normalize them,
  • mix carried values into a context vector,
  • carry that context into the next state.
Weighted values become carried state Attention is a table-shaped read: weights select value rows, their sum becomes context, and that context is carried forward. Weighted values become carried state The weights from the last step choose value rows. Their sum becomes the context carried forward. carried value table token weight value row contribution The 0.08 small cat 0.14 small sat 0.31 medium on 0.47 largest sum context vector cₜ weighted summary of the carried value rows next state carries cₜ plus the updated KV cache proof-facing relation: cₜ = Σᵢ weightᵢ · valueᵢ
AttentionAttention is a table-shaped read: weights select value rows, their sum becomes context, and that context is carried forward.

Where proving pressure appears

If transformer computation were only dense linear algebra, the structural story would be flatter.

Attention and transformer blocks include operations that create more distinctive proving pressure:

  • normalization,
  • nonlinear activation functions,
  • range and quantization behavior,
  • table-like operations,
  • and repeated reads from carried context.

This does not mean proving becomes impossible. It means the workload has shape: arithmetic, lookup-like work, range behavior, and boundary choices all live inside the same repeated loop.

Why the proof boundary matters

Here is where the current research starts, at the beginner level.

Imagine one attention step has two related jobs:

  • check the arithmetic that produced the scores and mixed the values;
  • check that the table-like work used the allowed rows, ranges, and policy.

One design can prove those jobs as separate objects. That can be valid, but it may repeat the surrounding machinery: commit to a trace, open values, and convince a verifier that each object is the one being claimed.

Another design can put the related work inside one proof boundary.

The lookup work does not disappear. The arithmetic does not disappear. The question is whether the wrapper around that work has to be repeated.

That is the direction we are testing now: not just whether transformer pieces can be proved, but where the proof boundary should live.

Choosing where the proof boundary lives A visual comparison of proving attention arithmetic and table membership as two separate proof objects or as one shared proof boundary. Same work. Different proof boundary. The research question is where related transformer work should be packaged. separate proof objects one shared boundary attention arithmetic scores, weights, value mix table membership ranges, rows, policy two wrappers carry similar machinery one proof object attention arithmetic table membership checked inside one accepted boundary one shared wrapper The lookup work is still there. The question is whether the surrounding proof machinery has to be repeated.
Boundary choice The beginner version of the current research: keep the work, move the boundary, then see what proof machinery can be shared.

This is still not a benchmark claim. It is why the trace framing matters. Once the computation has shape, you can move the boundary and ask what becomes simpler.

From repeated decode to trace

By now, decode is no longer a black box or a sequence of isolated guesses. It is a process with three visible properties:

  • it moves forward step by step,
  • it carries state from one step to the next,
  • and it keeps reusing the same family of local computations.

Once those properties are visible, the natural representation is a trace.

A trace records the state entering a step, enough local work to check the step, and the state leaving the step. Then it repeats that record for the next step.

A trace is an execution notebook Instead of one hidden computation, the run is written down one step at a time. A trace is an execution notebook The run is written down one transition at a time. row state before rule checked state after row 0 state 0 rule state 1 row 1 state 1 rule state 2 row 2 state 2 rule state 3 row 3 state 3 rule state 4 row 4 state 4 rule state 5 same rule shape, checked row after row
TraceInstead of one hidden computation, the run is written down one step at a time.

Why a trace fits decode

Suppose someone claims:

This model generated these tokens by correctly evolving its internal computation one step at a time.

The trace-style version says:

  • here is the state before step 0,
  • here is what step 0 was allowed to do,
  • here is the state after step 0,
  • here is the state before step 1,
  • here is what step 1 was allowed to do,
  • here is the state after step 1,
  • and so on.

That is what a trace gives you: not only the final answer, but the checked path that got there.

What a trace row actually is

One trace row should not dump every hidden model internal. It should record the information needed to support the next check.

At a conceptual level, a row wants to do three jobs:

  1. tell us what state entered the local step,
  2. expose enough of the local computation to check that the step was valid,
  3. tell us what state came out and will be handed to the next step.

That is the core pattern:

state in, local work, state out.

That pattern only becomes useful when it stops being a slogan and turns into columns the verifier can read.

One trace row is a contract A useful row names concrete columns: incoming frontier, checked work, emitted token evidence, and outgoing frontier. One trace row is a contract The row is useful only when the verifier can point at columns and constraints. row incoming frontier what this step is allowed to start from checked local work columns that make attention checkable outgoing frontier the boundary handed to the next row t pos=t, h_t, kv_root_t QKV attention scores pos=t+1, h_t+1, kv_root_t+1 same row prompt/model boundary range checks softmax table rows token/logits evidence verifier reads state columns not hidden prose constraint columns what must be checked next-row inputs what must match later the row becomes concrete when each word maps to columns
TraceA useful row names concrete columns: incoming frontier, checked work, emitted token evidence, and outgoing frontier.

A row does not have to mean one human-visible token. Real proving systems may use many rows for one token step, or group operations differently. The point is that row design is where the computational object becomes concrete.

A beginner’s view of one traced decode step

At decode step t, a useful row captures three boundaries:

  • the incoming boundary: what state was available at the start,
  • the local transition: what family of work was applied,
  • the outgoing boundary: what state was produced for the next step.

The row is not the token. The token is visible output; the row is the transition the proof system checks.

Why continuity matters as much as the local step

A row by itself is not enough. The output state of one step must really be the input state of the next.

This is where carried state becomes enforceable:

the state claimed to leave row t is exactly the state that row t + 1 claims to begin from.

Local checks are not enough The continuity constraint is the equality check between one row output and the next row input. Local checks are not enough A row can look locally valid while still not continuing the row before it. bad: valid pieces, broken handoff row t out: B row t+1 in: X local rules OK but not one run B != X good: adjacent rows agree on the handoff row state in local rule state out continuity check t A rule OK B B = B t+1 B rule OK C C = C continuity is the adjacent-row equality: out(t) = in(t+1)
TraceThe continuity constraint is the equality check between one row output and the next row input.

Why STARK-style constraints fit this shape

At a high level, a STARK-style proof wants three things:

  • a sequence of states,
  • local rules explaining valid evolution,
  • and explicit conditions at the boundaries.

That is the same language we have been using for decode. The structural-fit argument is about constraint shape, not branding.

A trace-based STARK-style system naturally asks questions like:

  • did each step evolve according to the right local rule?
  • did the whole run begin from the right state?
  • did it end at the state it claims to have reached?

Those are exactly the questions we care about once decode has been described as a repeated stateful process.

Local transition constraints

A local transition constraint is a disciplined way of saying:

if this is the state entering the step, and the local step is supposed to follow this transition family, then the next state must look like that.

A local transition rule checks that one step legitimately follows from the previous one.

In our running story, that means:

  • the step begins from a valid carried boundary,
  • the repeated local kernels are applied in the allowed way,
  • and the next boundary is the one the step was entitled to produce.

This is why repeated structure matters: transformer decode keeps reusing the same kind of step with different concrete values.

Local transition constraints The middle of the trace asks the same question again and again: did this row evolve by the allowed rule? Local transition checks Every middle row asks the same question: did this state evolve by the allowed rule? row state in constraint check state out row 0 state 0 rule holds? state 1 row 1 state 1 rule holds? state 2 row 2 state 2 rule holds? state 3 row 3 state 3 rule holds? state 4 row 4 state 4 rule holds? state 5 same constraint shape, checked repeatedly
STARK constraintsThe middle of the trace asks the same question again and again: did this row evolve by the allowed rule?

Boundary constraints

Boundary constraints answer the public-claim questions:

  • start from this prompt and this frontier,
  • run the decode loop,
  • end with this output and this resulting frontier.

Without explicit boundaries, the statement of the proof becomes muddy. With explicit boundaries, the proof object can remain narrow and meaningful:

  • this execution started here,
  • followed these transition rules,
  • and ended there.

That is cleaner than saying “prove the whole model” as an undifferentiated slogan. It is also the lesson behind the receipt work in this series: the proof has to bind the computation and the statement it is supposed to certify.

Boundary constraints Boundary constraints pin the private trace to the public claim: this start, this program boundary, and this claimed output. Boundary constraints pin the trace Local rules prove the rows connect. Boundary rules say this is the claimed run. private execution trace row state entering row transition check state leaving row 0 prompt frontier H₀ decode rule state 1 1 state 1 same rule state 2 ... carried state same rule carried state T state T output rule output hash Hₜ public claim initial pin row 0 must equal prompt frontier H₀ terminal pin row T must equal claimed output Hₜ why it matters: transition checks prove a valid path; boundary checks prove it is this path
STARK constraintsBoundary constraints pin the private trace to the public claim: this start, this program boundary, and this claimed output.

Put together, the claim is not just “there is a table” or “each row looks valid.” The claim needs all three checks at once.

The public statement pins the beginning and end. The local constraints check each row. The continuity constraints make neighboring rows one run instead of valid-looking fragments.

A claimed run needs pins, rules, and links The statement is credible only when public pins, per-row rules, and adjacent-row equality checks are bound together. A valid row table is still not the public claim The verifier has to bind the table to pins, rules, and row-to-row equality. public statement prompt root P model/policy M output root Y final frontier F_T these are the facts the proof is supposed to certify row state in local rule state out continuity 0 P, F_0 pinned start rule_0 OK F_1 F_1 = F_1 1 F_1 rule_1 OK F_2 F_2 = F_2 T F_T-1 rule_T OK Y, F_T pinned end all links OK one verifier-facing statement public pins + row rules + equality links = one claimed execution
TraceThe statement is credible only when public pins, per-row rules, and adjacent-row equality checks are bound together.

This is not just a wording preference.

In later essays, this boundary language becomes testable.

In the current proof-pressure experiments, the most interesting signal is not “we proved a full model.”

We have not.

The signal is about boundary placement. Some lookup-heavy attention work appears to fit better when the arithmetic and the table checks are accepted as one related proof object instead of separate objects with separate wrappers.

That is the direction Part 2 studies.

In a separate set of small adapter experiments, raw proof verification and application-level statement validity separated cleanly.

The raw proof path could answer a narrow question:

do these proof files verify against this verifier surface?

But the user-facing AI claim needs more:

this model, on this input, under this policy, produced this output, in this verifier domain.

When fields like model label, input label, output label, policy label, verifier domain, or artifact identity lived outside the accepted statement, relabeling stayed possible at the product boundary. When those fields were wrapped into a statement envelope, the same mutations became visible.

That is what the next essays make concrete.

Where the field is moving

This framing is also closer to where zkML research is moving in 2026.

NANOZK pushes layerwise transformer proofs instead of treating the model as one opaque circuit.

Jolt Atlas treats ONNX tensor traces as the proving surface instead of forcing everything through a generic CPU trace.

zkLLM and zkAttn specialize the attention mechanism and tensor lookup work.

DeepProve-1 frames production zkML around graph structure, autoregressive state, quantization, Softmax, LayerNorm, GELU, embeddings, and QKV reuse.

The common direction is not “wrap a model and hope.”

It is proof-system-aware transformer architecture: choose boundaries around the work the proof system actually sees.

This series takes the STARK-native side of that question.

It is not yet a benchmark victory over systems like NANOZK, Jolt Atlas, or DeepProve.

It is a smaller claim:

if transformer work is trace-shaped, then STARK-native proof boundaries may be able to share commitment, opening, lookup, and statement plumbing that generic wrappers duplicate.

Next

This first essay established the shape.

Part 2 asks where the proof work actually concentrates.

The rest of the series

Read the next pieces in order. They turn the trace picture into checked proof-boundary evidence and then into statement-validity discipline.

  1. 02

    Proof Pressure Is Not Just Matrix Multiply

    Where lookup-heavy work changes the proof shape.

  2. 03

    Proof Validity Is Not Statement Validity

    The claim has to be bound to the proof.

  3. 04

    What a Proof Is Allowed to Mean

    The typed receipt around a proof artifact.