Thesis
When people explain transformers, they usually explain them as prediction machines.
A prompt goes in. A probability distribution over the next token comes out. Repeat.
When people explain STARKs, they usually explain them as proof systems over traces.
A computation is written as a sequence of states. Local rules explain how one state becomes the next. Boundary constraints pin down where the computation starts and where it ends.
Both of those explanations are fine on their own. The problem is that they are usually taught in different rooms.
So the bridge between them ends up sounding much more mysterious than it really is.
This essay is trying to make that bridge visible.
Not by making a giant claim. Not by saying that everything about large models is suddenly solved. Not by pretending that one neat visual analogy proves a whole research program.
Just by asking a more useful question.
What kind of proof architecture matches the shape of transformer decode most naturally?
That wording matters.
It is more precise than asking “can transformers be proved?” and much more useful than asking “are STARKs better than SNARKs?” in the abstract.
Those broader questions are not meaningless. They are just too coarse to help us see the computational object clearly.
If a workload is mostly one giant static relation, we should expect one kind of proof representation to feel natural.
If a workload is a long repeated sequence of related states, with explicit continuity from one step to the next, we should expect something else to feel more natural.
Transformer decode is much closer to the second picture.
That is the core claim of this page.
The short version
If you cut transformer execution at the right boundary, decode already has the shape that a trace-based proof system wants: repeated local work, explicit carried state, and meaningful boundaries between one step and the next.
That sentence is doing a lot of work. So instead of moving on quickly, let us unpack it carefully.
If you are new to both sides of this topic, there is a simple way to read the page.
On the transformer side, we are asking what the model keeps doing during decode.
On the STARK side, we are asking what kind of computation is easiest to describe as a sequence of states with small local checks.
The whole essay lives in the overlap between those two questions.
We are not trying to force a proof system onto a model by pure willpower. We are asking whether the model’s own execution shape already points toward one style of proof object more than another.
That is also why this page keeps returning to “shape.” Shape is the beginner-friendly version of the question. Before we worry about exact proof costs, exact rows, or exact artifacts, we first ask: what kind of thing is this computation? Is it a one-shot object? Is it a loop? Does it carry state? Does it reuse the same kind of step? Those questions are easier to answer, and they lead naturally into the harder ones.
Start with the wrong mental picture
The wrong mental picture is this:
- a transformer is one giant black box,
- a proof system is some wrapper placed around it,
- and the only real question is whether the wrapper is powerful enough.
That picture is too blunt.
It hides the exact thing that matters most: internal shape.
A proof system is not neutral about structure. Different proving architectures are comfortable with different kinds of computational objects.
Some are happiest when the whole workload is presented as one large relation.
Others become more compelling when the computation can be written as:
- a sequence of states,
- a repeated family of local transition rules,
- and explicit boundary conditions.
That is the first shift this essay wants to make.
Do not start with “proof system around model.” Start with “what is the actual computational object during decode?”
Now replace the black box with a loop
A transformer used autoregressively is not really behaving like one giant one-shot function.
It is behaving like a loop.
At each step, something like this happens:
- there is a current execution boundary,
- there is carried context from previous steps,
- the model applies the same family of local computations again,
- it produces a new result for the current step,
- and it updates the carried state for the next step.
Even before we say anything about proofs, that is already a very different picture.
Now the model is not a single monolithic event. It is a repeated stateful process.
That is the picture we need.
Because once you see decode that way, terms like trace, transition, and boundary stop sounding like proof jargon imported from somewhere else.
They start sounding like faithful descriptions of what the computation was already doing.
Why this is worth slowing down for
It is easy to underestimate how much this changes.
If you think of a transformer only as “a network that outputs the next token,” then a trace-based proof representation can feel artificial.
Why rows? Why boundaries? Why talk about carried state so much?
But if you think of decode as a repeated evolution of state over time, those same concepts stop feeling artificial.
They become the obvious bookkeeping language.
The job of the rest of this essay is simply to make that obviousness visible.
The path will be:
- first make decode look like a state transition process,
- then make attention look like repeated local work instead of magic,
- then show how those steps naturally become a trace,
- then show why local constraints and boundary constraints line up with that trace,
- and only after that connect the picture to the artifact surfaces in the repo and paper.
That order matters.
If we jump to proof artifacts too early, the whole thing sounds like a product pitch. If we stay with the computational shape long enough, the later proof surfaces start to feel earned.
Decode as state transition
Let us begin with the part that is easiest to say and hardest to really internalize.
Autoregressive inference is a loop.
You know this already at a high level. But it is worth spelling out the loop slowly, because most later confusion starts when this step is rushed.
A prompt is not simply “consumed” and then forgotten. The system does not begin fresh at every generated token. What happened before matters, and it matters in a structured way.
At step t, the model begins from a meaningful frontier:
- there is a current token position,
- there is carried context from earlier steps,
- and there is a current state from which the next local computation starts.
Then the step runs. Then the frontier moves. Then the next step begins from that new frontier.
That is state transition language already.
For beginners, it helps to compare two ways of telling the same story.
The short product story is:
- the model reads a prompt,
- predicts one token,
- then predicts the next one.
The execution story is:
- the model starts from a current boundary,
- reads carried context,
- runs one more local transition,
- and produces the next boundary.
Both stories describe the same system. The second one is the one that matters for proving.
Decode structure
Repeated state transition, one inherited boundary at a time
1. A step begins from inherited state
Before the next token is produced, the machine already has a prefix, a position, and a cache frontier. Decode begins from an inherited state boundary, not from a blank page.
The token is the visible part, not the whole object
One reason this gets muddled is that the token is the most visible output.
People say: the model predicts the next token. That is true.
But if we only say that, we lose the deeper structure.
The more faithful description is:
- the step starts from a carried execution boundary,
- it performs a repeated family of local computations,
- it produces logits and possibly a selected token,
- and it updates the carried state that makes the next step meaningful.
The token matters. The logits matter. But for proving, the most important object is often the transition itself.
What came in? What local work happened? What state came out? Why is the next step allowed to begin from that output?
That is the object we need to keep in view.
It helps to make this concrete with a tiny example.
Suppose the prompt is The cat sat on the.
At the surface level, we say the model is choosing a next token such as mat.
At the execution level, something richer is happening:
- the step inherits everything built from the earlier words,
- the step reads the carried context,
- the step computes logits for the next position,
- and the step writes forward a new boundary that the following step will inherit.
The visible token is only the user-facing part of that step. The full computational object is the transition that produced it.
Think in terms of a moving frontier
A good way to picture decode is not as a chain of disconnected predictions, but as a moving frontier.
Each step advances that frontier.
At the beginning of the step, the frontier tells you what state is available. At the end of the step, the frontier has moved forward.
That is a better mental model than “one prompt plus one prediction.”
Why?
Because a moving frontier makes continuity visible.
And continuity is one of the main reasons trace-based proving starts to feel natural here.
If the only thing that mattered were a single isolated output, then the natural proof representation might be very different.
But decode is not isolated output. It is continuity under repeated local work.
Another way to say this is that decode behaves less like a calculator that answers one question and more like a process that keeps moving its own frontier forward.
What exactly is being carried?
Now let us get slightly more concrete.
When we say “state” here, we do not mean some mystical hidden essence of the model. We mean the information that must survive from one step to the next so that the next step is well-defined.
Depending on the execution boundary, that carried information can include things like:
- the current token index,
- hidden-state information relevant to the next local block,
- the cache frontier built from previous tokens,
- and any commitments or summaries that define where the next step begins.
The exact engineering boundary can vary. The conceptual point does not.
A decode step does not begin from zero. It begins from a frontier that earlier steps created.
That is one of the most important facts in this essay.
State continuity
Decode is a moving frontier, not isolated token guesses
1. Prefix already exists
Decode begins from an existing state surface. The prompt has already created a cache frontier, and the next step inherits it rather than inventing a new machine from scratch.
What stays stable
The transition family: read boundary, compute local work, write boundary.
What changes
The concrete token, the concrete cache contents, and the exact numerical values inside the frontier.
Why the visual matters
The proof question is easier once you see decode as continuity over rails, not as disconnected guesses.
Frontier growth
More decode means a larger carried frontier and more rows to account for
1. The frontier does not reset
Each decode step begins from the frontier left behind by the previous one. The carried object keeps growing rather than starting over.
Bars
How many decode rows have been executed so far.
Blue step line
How large the carried frontier has become by that point.
Why carried state is more important than people expect
It is tempting to treat carried state as mere implementation detail.
That is a mistake.
Carried state is not just a performance optimization. It is not just there to make generation faster. It is the thing that turns a series of repeated local computations into a coherent execution.
If we removed carried continuity entirely and forced every step to begin from nowhere, we would not just slow the model down. We would change the computational object.
That is why proof language starts talking about:
- state in,
- state out,
- and continuity across steps.
Those are not formal decorations added by proof people. Those are faithful descriptions of the execution that the runtime already performs.
This is also why the cache should not be treated as an afterthought in a discussion like this. For users, the cache often looks like a speed trick. For proving, it is more helpful to read it as part of the state boundary itself.
Why repeated work matters
Now add the second key property.
Not only does the system carry state. It also repeats the same family of local work over and over.
The exact values change at each position. The active context grows. The specific comparisons change.
But the kind of work is recognizably the same.
That matters because proof systems love reuse.
If each decode step were fundamentally different, the argument for a trace would be much weaker. A trace works best when the computation is long, regular, and locally constrained by reusable rules.
Transformer decode is not perfectly uniform in every low-level detail. But it is structured enough that the family resemblance between one step and the next is doing real work.
So by the end of this section we already have two ingredients:
- continuity across steps,
- and repetition of local computation.
Those two ingredients are enough to make a row-oriented execution view start to feel plausible.
We are not at the proof yet. But we are already much closer to the right object.
If you only keep one line from this section, keep this one: the model is not doing a fresh act of prediction from nowhere each time. It is continuing a structured execution.
Attention as repeated work
The next thing we need to fix is attention.
Attention is often taught well enough for practical intuition, but badly for structural intuition.
The familiar explanation says something like:
- queries look at keys,
- scores are computed,
- softmax turns scores into weights,
- values are mixed,
- and the result becomes part of the next hidden state.
That explanation is not wrong. But people often hear it as a kind of branded miracle.
So for our purposes we want an even simpler question:
What repeated work is the block actually doing?
Once we answer that, attention becomes much easier to align with a trace-based execution view.
This is the place where beginner intuition often breaks. People are told that the model “attends” to earlier tokens, which is true, but vague. For this essay, vague is not good enough.
So we will use a simpler reading:
- the current step forms a query,
- the query is compared against carried keys,
- those comparisons become weights,
- and those weights decide how carried values contribute to the next state.
That is already enough structure to work with.
Attention as a kernel
Unpack attention into repeated views, scores, and weighted state updates
1. Project the current state into useful views
Attention begins with a structured split. The current state and carried context are projected into queries, keys, and values so that later comparison has something disciplined to work with.
Why this figure exists
Attention becomes easier to reason about once it is presented as a repeated score-and-mix procedure over explicit state, not as a mysterious branded block.
Why it matters for STARKs
Once the block is decomposed into repeated local work, it becomes more plausible to describe it as a trace with reusable transition logic.
First: create useful views of the current state
An attention block does not operate on the current state in a completely raw way.
It first reorganizes that state into forms that can play different roles. The classic names are query, key, and value.
You do not need to become attached to those names for the structural point to matter.
What matters is this:
The block begins by taking current information and preparing it for a disciplined family of comparisons and updates.
That is already a decomposable story. It means the block is not one indivisible blob. It is a sequence of sub-operations with distinct jobs.
Why does that matter?
Because proof design cares about decomposition. A computation that naturally exposes smaller recurring sub-kernels is very different from a computation that only makes sense as one fused opaque relation.
Second: compare the current step against carried context
Now the current step has to decide what parts of the earlier context matter.
Again, we can say this without mystifying it.
The current query is compared against carried keys. Those comparisons produce scores. Those scores tell us something about how strongly the current step should attend to different parts of the carried context.
The important point is not the exact formula. The important point is the structure.
The current step is doing repeated local work against an explicit memory surface.
That surface is not metaphorical. It is concrete carried state.
This is one of the places where the execution view becomes much cleaner than the black-box view.
The black-box view says: the model somehow uses context.
The execution view says: the current step is consulting an explicit carried surface using a repeated comparison rule.
That second description is exactly the kind of thing we know how to record step by step.
At beginner level, you can think of one query as asking a practical question: which earlier pieces of context matter most right now?
Third: normalize the scores
This is where the proof conversation starts to get more interesting.
If transformers were only dense linear algebra, the proving story would be important but flatter. A lot of the interesting pressure appears because attention is not only projection and comparison. It also contains operations like normalization that are more awkward from a proving point of view.
This is where it helps to be precise.
The point is not that these operations make proving impossible. The point is that they create meaningful differences between proof architectures.
They are part of the reason it is not enough to treat the whole model as one giant arithmetic slab.
There are places where the execution is smooth and dense. There are places where the execution is more naturally discussed in terms of lookup handling, normalization strategy, nonlinear approximation, or other implementation-sensitive machinery.
That pressure is real. But notice where it appears.
It appears inside a repeated stateful process. Not outside it.
That matters.
This is one reason transformer proving is more interesting than the slogan “neural net equals arithmetic” suggests. Yes, there is lots of arithmetic. But there are also pieces that create more friction, and they appear inside one repeated execution story.
Attention matrix
The score matrix is a controlled lookup over carried context
1. One query compares itself to carried keys
The current step does not look back at a vague memory. It compares itself against explicit carried keys from earlier positions.
What to notice
The same few shapes keep reappearing: compare, weight, then mix prior state into the next boundary.
Why this helps
It makes attention look like repeated structured work instead of a single opaque block with a famous name.
Fourth: use those weights to mix values into the next state
Once the scores are normalized, they are used to mix carried values into the new hidden state.
Again, you can read this at two levels.
At a high level, it says: use the right parts of context.
At a more structural level, it says: compute a weighted summary from the explicit carried memory surface and fold it into the next state.
That second phrasing is more useful for us.
Why?
Because it makes the step look like a real computational transition instead of an act of model magic.
The block started from current state plus carried context. It built structured intermediates. It compared them. It normalized those comparisons. It mixed carried values into a new state.
That is a sequence. And sequences can be recorded.
Why a good visual of attention matters so much
Good attention diagrams do more than explain one model component. They remove the feeling that transformer computation is fundamentally opaque.
Once the reader can see:
- a current step,
- a carried memory surface,
- repeated comparisons,
- normalized weights,
- and an updated state,
then it becomes much easier to accept the next move.
Because the next move is simply:
what if we recorded those transitions systematically?
That question sounds strange only if attention still feels like magic.
Once attention looks like a recurring procedure over explicit memory, the move to traces becomes much easier. We are no longer trying to prove a cloud. We are trying to record a repeated family of transitions.
This is why “just matrix multiplication” is not enough
One common simplification is to say that a transformer is basically big matrix operations.
That is not useless, but it flattens too much.
The structure that matters here is more like:
- projection,
- comparison,
- normalization,
- weighted mixing,
- residual update,
- and repeated continuation into the next step.
The reason to preserve that structure is not aesthetic. It is because different proof architectures feel pressure in different places.
If we erase the distinctions too early, we also erase the reason this workload has a particular proving personality.
This essay is trying not to erase that personality.
That is why these figures spend time on intermediate structure. They are not decoration. They are there to stop the computation from collapsing back into one giant unnamed block.
From decode to trace
Now we are ready for the step that often sounds the most technical but is actually the most natural.
Once you have a repeated stateful computation, the obvious next question is:
How should we write it down?
A trace is one answer.
A trace is not magic notation. It is not a special object invented to make proofs feel intimidating.
It is a disciplined record of a computation as a sequence of states and local transitions.
Once decode is already being viewed as repeated state transition, the trace stops feeling like a foreign imposition. It becomes the natural ledger.
If the word “trace” feels heavy, replace it for a moment with something simpler: a careful execution log.
Not a casual debug log, but a disciplined record of the right states at the right moments so later checks can talk about continuity.
Step to trace
One decode step can be written as one row, and that row hands the next step its boundary
1. Start with one concrete decode step
Begin from one step, not the whole model. A token arrives, a carried boundary is inherited, and one local transition runs.
Why the row matters
The row is not extra ceremony. It is the explicit written form of one decode transition.
Why the boundary matters
The next step is only meaningful because the previous row produced a concrete boundary it can inherit.
Trace mapping
Once decode repeats, writing it down as rows stops looking artificial
1. Start with repeated decode steps
At the top, each card is one concrete decode step. The token changes, but the family of work stays recognizable.
What changes
The token and the exact state values change at every row.
What stays stable
The step shape stays stable enough that the same kind of row can be reused.
Think of the trace as a careful notebook
Suppose you had to convince someone that a long iterative computation really evolved the way you claimed.
One way would be to hand them one enormous final relation and ask them to trust that all intermediate consistency is encoded inside it.
Another way would be to show them a careful notebook:
- here is the state at step 0,
- here is what the rule says must happen,
- here is the state at step 1,
- here is the same rule applied again,
- and so on.
That notebook picture is much closer to what a trace is doing.
A trace makes continuity first-class.
This is also why the trace helps as a teaching device before it helps as a proof object. It makes the execution explicit.
Instead of hiding the computation inside one opaque object, it says:
- these are the states,
- these are the transitions,
- and these are the edges.
That is exactly the kind of bookkeeping we were gradually preparing for in the previous sections.
What goes into a row?
A row is not necessarily one naive “token equals one row” story. The exact granularity is an engineering decision.
But conceptually, a useful row-oriented representation wants to expose enough information to support three jobs:
- describe the incoming state for a local transition,
- expose whatever local intermediate structure must remain consistent,
- and describe the outgoing state that the next row depends on.
That is why row design matters.
It is not clerical work. It is the place where the computational object becomes explicit.
Still, the reason row design is even believable here is that the workload is not arbitrary.
Because decode already has:
- repeated local structure,
- explicit carried state,
- and meaningful continuity,
it is plausible to imagine a row or bundle of rows carrying exactly the right information.
If those ingredients were absent, the row view would feel forced. Here it does not.
For a beginner, the key point is not the exact row layout. The key point is that there is a sensible row layout to be found. That already tells you something important about the workload.
Why the trace is more than a pretty diagram
At first glance, it can seem as though the trace is just a different presentation of the same underlying computation.
That is true in one sense and false in another.
It is true that the computation is not changed merely because we wrote it down differently.
But it is false that nothing important changes.
The trace changes what becomes explicit.
In the trace view, the following are no longer hidden inside one big object:
- where the computation begins,
- which local state is present at each step,
- what rule links one step to the next,
- and what public boundary the whole execution is claiming.
That is not cosmetic. That is an actual shift in what the proof system is allowed to talk about directly.
Why continuity matters as much as width
Transformers are wide models, so it is natural to focus on the sheer amount of arithmetic work.
That arithmetic matters.
But under autoregressive decode, continuity matters too.
A convincing execution story is not only “each local step was valid in isolation.” It is also “the next step really began from the state the previous one claimed to produce.”
That is why row-to-row continuity is so important.
It is the place where the abstract phrase “carried state” stops being conceptual and starts being enforceable.
This is another reason trace-native reasoning feels at home here.
The workload itself cares about continuity. So the proof representation caring about continuity is not a mismatch. It is a good sign.
Why the trace can be easier to understand than the circuit
People sometimes assume that traces are the more exotic viewpoint and circuits are the more intuitive one.
For some workloads, that may be true.
For decode, I think the opposite is often true.
If you start from the real execution story:
- repeated steps,
- explicit carried context,
- state moving forward one boundary at a time,
then a trace is simply the faithful way to write that story down.
What feels exotic is not the trace. What feels exotic is pretending the whole process should be understood only as one giant relation with all local continuity flattened away.
The trace is the bridge between two statements:
- the model keeps decoding,
- and the proof system can check a long repeated computation.
A beginner walkthrough of one traced step
At this point, it is useful to walk through one decode step slowly and say what the trace is really trying to capture.
Imagine one step in plain language:
- the model arrives at a new token position,
- it already has context from everything that came before,
- it runs the next local block of work,
- it produces logits for the current position,
- and it writes forward the next state.
Now say the same thing in trace language:
- row
tcontains the state that enters the step, - the local rule explains which computations are allowed,
- row
t + 1contains the state that leaves the step, - and continuity checks make sure the next row really starts from the right place.
Nothing mystical happened there.
We just took a familiar execution story and wrote it down more carefully.
That is why the trace matters so much in this essay. It is the place where the model stops being explained only as a predictor and starts being described as a machine with state, transitions, and boundaries.
Once you can tell that story for one step, telling it for many steps is no longer conceptually difficult. It is just repetition.
Why the constraint shape fits
Once the trace is in view, the next question becomes straightforward.
What does a STARK want from such a trace?
At a high level, it wants three things:
- a sequence of states,
- local rules explaining valid evolution,
- and explicit conditions at the edges.
That is why the fit argument is about constraint shape, not just raw size or brand-name comparisons.
At a beginner level, the constraint story can be reduced to two simple questions:
- did each step follow the rule it was supposed to follow?
- did the whole run start and end where it claimed to start and end?
The first question is the local one. The second question is the boundary one.
Constraint anatomy
Local rules in the middle, boundary rules at the edges
This is the structural fit argument in one frame. STARK-style systems want a repeated local rule over many rows, plus explicit boundary conditions. Transformer decode already wants repeated local computation with explicit carried state.
Good fit
Long repeated computation, state continuity, explicit start/end conditions.
Not the claim
This does not prove every aspect of LLM behavior. It argues that a useful execution boundary has the right shape for trace-based proof systems.
Local transition constraints
The middle of the trace is where repeated local logic lives.
This is the part that says:
- if row
thas this structure, - and the local transition is valid,
- then row
t + 1must look like that.
This should sound familiar by now.
It is just a formal restatement of how we have already been talking about decode.
The same family of local work keeps being applied. The specific values change. The rule family stays recognizable.
That is exactly the kind of environment where reusable local constraints start to make sense.
Notice what we are not saying.
We are not saying every transformer implementation automatically becomes an elegant AIR by magic. We are not saying all engineering difficulties disappear.
We are saying something narrower and more defensible:
the computational object already has the sort of repeated local structure that makes local-rule reasoning plausible rather than forced.
That is the real claim.
This point matters because it is easy to oversell. We are not claiming that every transformer kernel drops into a final polished proving system with no design work. We are claiming that the workload gives us a promising starting shape.
Boundary constraints
Now consider the edges.
A repeated transition story needs a beginning and an end.
Where does the execution start? What prompt-conditioned state or frontier is the first step allowed to assume? What output boundary is the final step claiming to have reached?
These are boundary questions.
And again, they are not imported from outside the computation. They are already part of how we talk about execution informally.
We naturally say things like:
- start from this prompt and this current frontier,
- run the decode loop,
- end with this resulting output and this resulting frontier.
Boundary constraints are simply the disciplined version of that story.
They matter because they say exactly what the proof object is certifying.
Without clear boundaries, the statement itself becomes muddy. With clear boundaries, the proof object can remain narrow and meaningful.
Why the awkward parts matter
If transformer workloads were only giant dense arithmetic kernels, the structural story would still be interesting, but it would be easier to flatten.
The reason the discussion becomes richer is that the real workload contains pressure points:
- normalization-heavy operations,
- nonlinearities,
- lookup-sensitive pieces,
- and implementation decisions that actually change how natural different proof representations feel.
That matters because it reminds us that this is not a generic neural-net slogan.
The proving personality of the workload comes from a combination of things:
- repeated local steps,
- explicit continuity,
- dense arithmetic kernels,
- and harder sub-operations that must still be integrated into the same execution story.
That combination is the point.
If you are a beginner, keep your footing here. You do not need to memorize every hard sub-operation. You only need to notice that the model mixes easy-to-describe repeated work with harder pieces that still have to fit into one execution story.
Why we should see the shape before the counts
The paper goes further than this essay and supports the structural claim with symbolic comparisons and decomposition arguments. That matters.
But the order still matters.
You should not have to memorize a ratio before the computational object itself makes sense.
If the shape is wrong, no ratio can rescue the story. If the shape is right, the ratios become supporting evidence rather than the whole explanation.
Why this page stays visual
The paper can carry the counts, caveats, and symbolic decomposition. The essay has a different job: make the underlying computational shape clear enough that those later numbers stop feeling arbitrary.
So the refrain of this page keeps returning to the same few features:
- repeated steps,
- carried state,
- explicit boundaries,
- local continuity,
- and a trace that makes all of that visible.
That is the structural argument. The paper can then quantify it more formally.
From structure to artifacts
So far we have stayed at the level of computational shape. That is exactly where we should begin.
But eventually a fair question shows up:
Does this way of looking at execution produce real objects, or only nice explanations?
That is where the repository matters.
The reason this line of work is interesting is not only that the structural story is aesthetically satisfying. It is that the same story can be used to define runtime boundaries, execution traces, and proof-carrying artifacts.
This is the point where a fair beginner question shows up: why should I care about the repo if the visual argument already feels convincing?
The answer is that a visual argument becomes serious only when it starts producing objects that can be executed, checked, and discussed precisely.
The repo matters because it forces the argument to leave the world of analogy.
From structure to artifact
The idea becomes valuable when it turns into an actual proof surface
01
Program
.tvm or another deterministic execution surface
02
Runtime
step-by-step decode with explicit carried state
03
Trace
rows and local checks over state transitions
04
Artifact
a claimable object with a public boundary
Why this is not just analogy
A good analogy helps the reader see a shape. It does not automatically give you a system.
The reason to care about this work is that the execution view is not being left at the level of metaphor. It is being used to decide what the claimable object should be.
That is a much stronger standard.
Instead of saying:
- “transformers are kind of like a machine,”
we can ask:
- can one decode step be represented as a deterministic transition with a meaningful state boundary?
- can the boundary between steps be made explicit enough to carry forward?
- can later artifacts preserve that meaning instead of erasing it?
Those are system questions, not branding questions.
That distinction matters because AI writing often becomes vague exactly where it should become concrete. The better question is never “can we describe the model in a clever way?” The better question is “what boundary can we actually implement, record, and later certify?”
What a step-level object has to say
If you want a decode step to become a real artifact, it has to make a few things explicit.
At minimum, a useful step-level object wants to say:
- what state entered the step,
- what transition family was applied,
- what state left the step,
- and what carried boundary information the next step is allowed to inherit.
Now notice how much of the essay was quietly preparing for exactly that.
We spent so much time on:
- state in,
- state out,
- carried frontier,
- repeated local work,
- and boundary discipline,
because those are the ingredients that a later artifact surface must preserve.
If the boundary is muddy, any later packaging layer will also be muddy. If the boundary is sharp, later layers have a stable meaning to inherit.
Why paper 1 should stay narrow
This is also why the current paper should remain disciplined.
The right goal for paper 1 is not to claim every future proof surface at once. The right goal is to show that the base execution object has the right shape.
If one step can be expressed as a meaningful deterministic transition with an explicit carried boundary, then later work has something real to build on.
That is a healthy research posture.
It avoids two common mistakes:
- saying too much too early,
- or refusing to say anything until every future layer is complete.
The right middle ground is:
- choose a narrow statement,
- make its boundary explicit,
- make the step object believable,
- and let later composition inherit that object rather than replacing it.
For beginners, this may be the cleanest summary of the engineering posture: start small enough that the statement is clear, but not so small that it becomes meaningless.
Why later composition depends on present discipline
Once a step has a reusable boundary, you can begin to imagine richer packaging.
You can imagine chains of steps. You can imagine segments. You can imagine larger aggregation structures that preserve the meaning of the base object rather than flattening it away.
That does not give you full recursive compression for free. It does not mean every stronger accumulation surface is already solved.
What it does give you is something more important:
it gives later work a credible starting point.
If the base object were chosen badly, later layers would only hide the conceptual problem. If the base object is chosen well, later layers have a real foundation.
That is why the structural fit argument matters. It is not the end of the story. It is what makes the rest of the story worth attempting.
What a beginner should now be able to say
If the essay has done its job, a beginner should now be able to say something like this:
“A transformer under decode is not just spitting out isolated tokens. It is carrying state forward step by step. Attention is part of that repeated local work. A trace is a careful way of writing those steps down. A STARK is interesting here because it likes long repeated computations with local checks and clear boundaries.”
That is already a real understanding.
It is not the full paper. It is not the full engineering story. But it is enough to keep the rest of the work from sounding arbitrary.
And that matters, because a lot of confusion in this area comes from trying to jump straight from a large model to a large proof without ever stopping to ask what the computation looks like in the middle.
This page has tried to stay in the middle long enough for that structure to become visible.
What this essay does and does not claim
This page should end by staying narrow.
That is not a weakness. It is a discipline.
The fastest way to make this whole topic useless is to jump from one clean structural observation to a giant slogan about proving all of AI.
That is not what this essay is trying to do.
If anything, the value of this page is that it tries to make a smaller claim feel solid instead of making a larger claim feel exciting.
What it does claim
It does claim that:
- transformer decode has repeated state-transition structure,
- attention can be understood as repeated local work over explicit carried context,
- a trace-based representation is a plausible and natural way to record that execution,
- and STARK-style reasoning becomes interesting precisely because it talks in terms of local transitions, boundaries, and continuity.
Those are meaningful claims.
They are strong enough to motivate engineering work. They are narrow enough to defend.
What it does not claim
It does not claim that:
- every property of model behavior is suddenly proved,
- every implementation difficulty is already solved,
- arbitrary nondeterministic behavior disappears,
- or one visual story settles all questions about practical proving performance.
Those distinctions matter.
The value of this line of work is not that it lets us say bigger things. The value is that it lets us say smaller things more clearly.
That is also why the beginner version of the takeaway is simple:
- read decode as a repeated process,
- read attention as structured local work,
- read the trace as the record of that process,
- and read the proof question as a question about whether those transitions and boundaries were valid.
The takeaway
If you remember only one idea from this page, let it be this.
Transformers are not interesting for proving merely because they are large neural networks. They are interesting because, under decode, they expose a specific computational shape.
They:
- repeat structure,
- carry state,
- evolve a frontier step by step,
- admit meaningful boundaries,
- and combine dense local kernels with harder sub-operations inside one coherent execution story.
When a proof system’s native objects line up with those features, the fit begins to feel less imposed and more natural.
That is the paper-1 argument in its most visual form.
The paper gives the argument formal support. The repository turns it into engineering artifacts. This essay is doing an earlier job:
make the shape clear enough that the rest of the work stops sounding arbitrary.