Where Transformer Proof Pressure Appears

Part 2

Once decode is visible as a trace, the next question is where the proof work concentrates.

Part 1 argued that transformer decode has the right shape for trace-based proving.

But shape is only the first step.

A proof system still has to pay for the work inside each step.

Where the pressure appears

At this point, it is tempting to say:

so transformers are just repeated arithmetic.

That is not quite right.

The workload is more interesting than that.

There is a lot of dense arithmetic, yes.

But there are also awkward pieces:

normalization,
lookup-sensitive work,
nonlinearities,
range and quantization behavior,
and implementation choices that change how naturally different proof architectures fit the execution.

That awkwardness is not a side issue.

It is part of the workload’s personality.

And it appears inside the same repeated stateful execution we have been following all along.

The proving story is not:

easy arithmetic here, weird stuff somewhere else.

It is:

one coherent execution story containing both easy repeated structure and harder local sub-operations.

That is what makes the problem interesting.

The symbolic counts and decompositions matter later.

But the counts only make sense once the shape is visible.

Why the shape has to come before the numbers

At some point, of course, we want numbers.

We want to know:

how many symbolic rows or constraints appear,
how the gap behaves as context grows,
where softmax-like work starts to matter more,
and how sparse schedules change the picture.

That is formal work.

But those numbers are only useful after the computational object has become clear.

If the shape is wrong, no ratio will rescue the argument.

If the shape is right, the ratios become evidence instead of mystery.

That is why this essay stays structural for so long.

The point is not to avoid quantitative work.

The point is to make the quantitative work belong to something the reader can already see.

For example, the worked symbolic model behind this essay gives GPT-2 small about $157.8\mathrm{B}$ symbolic SNARK constraints versus $106.5\mathrm{B}$ symbolic STARK rows across 12 layers:

\frac{157.8\mathrm{B}}{106.5\mathrm{B}} \approx 1.48\times

That is about $1.48\times$ more symbolic work on the SNARK side under those assumptions.

That number is not the essay.

It is a checkpoint that only becomes meaningful after the shape is clear.

What the numbers now add

Now that the shape is visible, we can add the quantitative part without making it feel like magic.

These numbers do not count all of machine learning.

The model does something narrower.

It takes a standard transformer block and separates the work into two buckets.

The first bucket is dense arithmetic:

projections,
attention score products,
value aggregation,
output projection,
and the feedforward network.

Those are the parts that already look like arithmetic.

Both SNARK-style and STARK-style systems have to deal with them.

So the model does not pretend that dense matrix multiplication disappears.

It keeps that arithmetic base shared.

The second bucket is the awkward part:

softmax,
normalization,
activation functions,
range and table-like behavior.

Those are the parts where proof systems start to differ more sharply.

That is why the numbers matter.

They are not trying to say:

this exact prover will run this exact model in this exact number of seconds.

They are trying to say:

if dense arithmetic is the shared base, where does the proof pressure move?

That is a better question for this essay.

It connects directly to the structure we have already been building.

The same repeated decode loop that carries state also keeps reusing non-arithmetic machinery.

If that machinery can be handled through lookup-friendly, trace-friendly surfaces, then the repeated structure starts to matter.

This is where nearby zkML systems are useful context.

zkLLM treats attention and non-arithmetic tensor operations as first-class proving problems.

Jolt Atlas comes from the SNARK side and also leans into lookup-centric tensor operations.

Those examples are useful because they point at the same pressure zone from different proving traditions: dense arithmetic is not the whole story.

Proof pressure Use this as a map: blue is shared arithmetic; orange is repeated non-arithmetic work where the proving architecture starts to matter.

Keep that picture in mind.

The equations below are only one way of putting numbers on that picture.

The worked example

The clean worked example is GPT-2 small.

The parameters are:

model dimension $d = 768$ ,
context length $T = 1024$ ,
heads $H = 12$ ,
layers $L = 12$ .

The stylized non-arithmetic constants are:

$C_{\mathrm{exp}} = 300$ for softmax-like exponential work,
$C_{\mathrm{norm}} = 30$ for normalization,
$C_{\mathrm{nonlin}} = 150$ for GELU-like nonlinear work.

Those constants are not measured runtime constants.

They are knobs in a symbolic model.

The point is to expose sensitivity.

The model keeps the arithmetic part exact:

A(T,d) = 12Td^2 + 2T^2d + 2Td

Then it separates the non-arithmetic part:

N(T,d,H) = T^2H + 6Td

On the SNARK side, the non-arithmetic terms are multiplied by the symbolic constants:

\begin{aligned} C_{\mathrm{SNARK}} ={}& 12Td^2 + 2T^2d + 2Td \\ &+ T^2H C_{\mathrm{exp}} + 2Td C_{\mathrm{norm}} + 4Td C_{\mathrm{nonlin}} \end{aligned}

On the STARK side, the model keeps the lookup-like terms as rows:

L_{\mathrm{STARK}} = 12Td^2 + 2T^2d + T^2H + 8Td

That gives the headline checkpoint:

Quantity	Symbolic SNARK side	Symbolic STARK side
Total per layer	$13.15\mathrm{B}$ constraints	$8.88\mathrm{B}$ rows
Total across 12 layers	$157.8\mathrm{B}$ constraints	$106.5\mathrm{B}$ rows
Ratio	$1.48\times$	baseline

Symbolic result The worked example is not a runtime benchmark. It shows where the proof pressure moves once dense arithmetic is no longer the only story.

Read that carefully.

It does not say every STARK prover is automatically $1.48\times$ faster.

It says that under this symbolic model, the SNARK-side relation is doing about $1.48\times$ more symbolic work for this GPT-2-small setting.

That is a structural pressure signal, not a wall-clock benchmark.

Why softmax matters so much

The most important detail is not only the final ratio.

The important detail is where the extra work comes from.

In this model, softmax alone contributes about $87.9\%$ of the SNARK-side non-arithmetic overhead.

That is why the essay spent time on attention.

The attention block is where the model compares the current state against carried context.

Then it normalizes those scores.

That normalization is not an isolated nuisance.

It sits inside the repeated stateful loop.

So if a proof system has a cleaner way to represent repeated lookup-like work, the advantage is not a one-off trick.

It can compound with the shape of decode.

This is also why the result should not be read as insensitive to assumptions.

It is sensitive to how softmax-like work is represented.

If a SNARK-side system uses a very good specialized lookup or approximation strategy, the gap narrows.

That is not a problem for the argument.

It is one of the main points:

the proof architecture should be judged by how naturally it handles the repeated non-arithmetic machinery, not only by how it handles dense linear algebra.

What happens as context grows

The GPT-2-small example at $T = 1024$ gives $1.48\times$ .

If the same model is scaled to $T = 4096$ , the symbolic ratio rises to about $2.13\times$ .

That makes intuitive sense.

Longer context means more attention work.

More attention work means more places where softmax-like handling matters.

But the ratio does not grow forever.

The dense GPT-style ratio approaches a finite ceiling:

\begin{aligned} \lim_{T \to \infty} R(T) &= \frac{2d + H C_{\mathrm{exp}}}{2d + H} \\ &= \frac{2d_h + C_{\mathrm{exp}}}{2d_h + 1} \end{aligned}

For GPT-2 small, $d_h = 64$ .

Under $C_{\mathrm{exp}} = 300$ , that ceiling is about $3.32\times$ .

That is a useful result because it prevents the argument from becoming a vague “longer is always infinitely better” story.

The claim is more disciplined:

the ratio rises over practical context ranges and then saturates at a finite architecture-dependent ceiling.

That is exactly the kind of claim we can defend without pretending the number grows forever.

What changes for wider or sparse models

GPT-2 small is a clean teaching example.

It is not the whole world.

Wider models change the balance.

In a wider Llama-style dense reference, the dense arithmetic base is much larger.

That can keep the symbolic ratio closer to parity at shorter contexts, especially under lower softmax constants.

That matters because it prevents cherry-picking.

The stronger version of the argument does not hide that case.

It says:

wide production-style dense models can remain near parity at shorter contexts under lower softmax constants, while still widening materially at longer windows.

Sparse long-context models also change the picture.

A Gemma-style local/global sparse schedule reduces how much attention work grows at some positions.

That tempers the gap.

But if there is still a positive global-attention fraction, the same kind of finite long-context ceiling remains.

The sparse schedule delays the approach to the ceiling.

It does not erase the structural point.

This is the right level of honesty.

The point is not:

STARKs win every setting by a huge constant.

The point is:

transformer workloads concentrate important proving pressure in repeated non-arithmetic machinery, and trace-native systems are structurally well positioned for that pressure.

The pressure map explains where proving work appears.

But a proof that verifies is still not enough if the surrounding claim can be relabeled.

Continue to Part 3: Proof Validity Is Not Statement Validity.