blog.espejel.lol

Where Transformer Proof Pressure Appears

The interesting zkML question is not only matrix multiplication. It is where repeated non-arithmetic work appears.

Dense arithmetic is not the whole zkML story. The interesting pressure appears in repeated softmax, normalization, range, and table-like work.

Omar Espejel / 2026-04-30 / Updated 2026-04-30 / 14 min / transformers / zkml / proof systems / proof pressure

Part 2

Once decode is visible as a trace, the next question is where the proof work concentrates.

Part 1 argued that transformer decode has the right shape for trace-based proving.

But shape is only the first step.

A proof system still has to pay for the work inside each step.

Where the pressure appears

At this point, it is tempting to say:

so transformers are just repeated arithmetic.

That is not quite right.

The workload is more interesting than that.

There is a lot of dense arithmetic, yes.

But there are also awkward pieces:

  • normalization,
  • lookup-sensitive work,
  • nonlinearities,
  • range and quantization behavior,
  • and implementation choices that change how naturally different proof architectures fit the execution.

That awkwardness is not a side issue.

It is part of the workload’s personality.

And it appears inside the same repeated stateful execution we have been following all along.

The proving story is not:

  • easy arithmetic here, weird stuff somewhere else.

It is:

  • one coherent execution story containing both easy repeated structure and harder local sub-operations.

That is what makes the problem interesting.

The symbolic counts and decompositions matter later.

But the counts only make sense once the shape is visible.

Why the shape has to come before the numbers

At some point, of course, we want numbers.

We want to know:

  • how many symbolic rows or constraints appear,
  • how the gap behaves as context grows,
  • where softmax-like work starts to matter more,
  • and how sparse schedules change the picture.

That is formal work.

But those numbers are only useful after the computational object has become clear.

If the shape is wrong, no ratio will rescue the argument.

If the shape is right, the ratios become evidence instead of mystery.

That is why this essay stays structural for so long.

The point is not to avoid quantitative work.

The point is to make the quantitative work belong to something the reader can already see.

For example, the worked symbolic model behind this essay gives GPT-2 small about 157.8B157.8\mathrm{B} symbolic SNARK constraints versus 106.5B106.5\mathrm{B} symbolic STARK rows across 12 layers:

157.8B106.5B1.48×\frac{157.8\mathrm{B}}{106.5\mathrm{B}} \approx 1.48\times

That is about 1.48×1.48\times more symbolic work on the SNARK side under those assumptions.

That number is not the essay.

It is a checkpoint that only becomes meaningful after the shape is clear.

What the numbers now add

Now that the shape is visible, we can add the quantitative part without making it feel like magic.

These numbers do not count all of machine learning.

The model does something narrower.

It takes a standard transformer block and separates the work into two buckets.

The first bucket is dense arithmetic:

  • projections,
  • attention score products,
  • value aggregation,
  • output projection,
  • and the feedforward network.

Those are the parts that already look like arithmetic.

Both SNARK-style and STARK-style systems have to deal with them.

So the model does not pretend that dense matrix multiplication disappears.

It keeps that arithmetic base shared.

The second bucket is the awkward part:

  • softmax,
  • normalization,
  • activation functions,
  • range and table-like behavior.

Those are the parts where proof systems start to differ more sharply.

That is why the numbers matter.

They are not trying to say:

this exact prover will run this exact model in this exact number of seconds.

They are trying to say:

if dense arithmetic is the shared base, where does the proof pressure move?

That is a better question for this essay.

It connects directly to the structure we have already been building.

The same repeated decode loop that carries state also keeps reusing non-arithmetic machinery.

If that machinery can be handled through lookup-friendly, trace-friendly surfaces, then the repeated structure starts to matter.

This is where nearby zkML systems are useful context.

zkLLM treats attention and non-arithmetic tensor operations as first-class proving problems.

Jolt Atlas comes from the SNARK side and also leans into lookup-centric tensor operations.

Those examples are useful because they point at the same pressure zone from different proving traditions: dense arithmetic is not the whole story.

Proof pressure inside repeated decode A visual explanation separating shared dense arithmetic from repeated non-arithmetic machinery where proof architecture matters more. Where the pressure appears Each decode step has a blue part everyone pays and an orange part where proof design matters. decode step 0 dense arithmetic softmax, norm, range, tables decode step 1 dense arithmetic softmax, norm, range, tables decode step 2... dense arithmetic softmax, norm, range, tables shared base everyone pays dense arithmetic proof-design pressure repeated non-arithmetic machinery use the numbers as a pressure map, not as a runtime benchmark
Proof pressure Use this as a map: blue is shared arithmetic; orange is repeated non-arithmetic work where the proving architecture starts to matter.

Keep that picture in mind.

The equations below are only one way of putting numbers on that picture.

The worked example

The clean worked example is GPT-2 small.

The parameters are:

  • model dimension d=768d = 768,
  • context length T=1024T = 1024,
  • heads H=12H = 12,
  • layers L=12L = 12.

The stylized non-arithmetic constants are:

  • Cexp=300C_{\mathrm{exp}} = 300 for softmax-like exponential work,
  • Cnorm=30C_{\mathrm{norm}} = 30 for normalization,
  • Cnonlin=150C_{\mathrm{nonlin}} = 150 for GELU-like nonlinear work.

Those constants are not measured runtime constants.

They are knobs in a symbolic model.

The point is to expose sensitivity.

The model keeps the arithmetic part exact:

A(T,d)=12Td2+2T2d+2TdA(T,d) = 12Td^2 + 2T^2d + 2Td

Then it separates the non-arithmetic part:

N(T,d,H)=T2H+6TdN(T,d,H) = T^2H + 6Td

On the SNARK side, the non-arithmetic terms are multiplied by the symbolic constants:

CSNARK=12Td2+2T2d+2Td+T2HCexp+2TdCnorm+4TdCnonlin\begin{aligned} C_{\mathrm{SNARK}} ={}& 12Td^2 + 2T^2d + 2Td \\ &+ T^2H C_{\mathrm{exp}} + 2Td C_{\mathrm{norm}} + 4Td C_{\mathrm{nonlin}} \end{aligned}

On the STARK side, the model keeps the lookup-like terms as rows:

LSTARK=12Td2+2T2d+T2H+8TdL_{\mathrm{STARK}} = 12Td^2 + 2T^2d + T^2H + 8Td

That gives the headline checkpoint:

QuantitySymbolic SNARK sideSymbolic STARK side
Total per layer13.15B13.15\mathrm{B} constraints8.88B8.88\mathrm{B} rows
Total across 12 layers157.8B157.8\mathrm{B} constraints106.5B106.5\mathrm{B} rows
Ratio1.48×1.48\timesbaseline
Symbolic model results for GPT-2 small A clear table-style visual summary of the symbolic SNARK and STARK counts used in the essay. What the numbers say The dense arithmetic base is shared. The pressure appears in repeated non-arithmetic work. component symbolic SNARK side symbolic STARK side meaning dense arithmetic 8.86B / layer 8.86B / layer everyone pays this softmax-like work 3.77B / layer 12.6M / layer 87.9% of modeled SNARK overhead norm + activation 519.0M / layer 4.7M / layer lookup-shaped pressure total, 12 layers 157.8B constraints 106.5B rows 1.48× same arithmetic base, different non-arithmetic pressure These are symbolic counts, not wall-clock speed claims.
Symbolic result The worked example is not a runtime benchmark. It shows where the proof pressure moves once dense arithmetic is no longer the only story.

Read that carefully.

It does not say every STARK prover is automatically 1.48×1.48\times faster.

It says that under this symbolic model, the SNARK-side relation is doing about 1.48×1.48\times more symbolic work for this GPT-2-small setting.

That is a structural pressure signal, not a wall-clock benchmark.

Why softmax matters so much

The most important detail is not only the final ratio.

The important detail is where the extra work comes from.

In this model, softmax alone contributes about 87.9%87.9\% of the SNARK-side non-arithmetic overhead.

That is why the essay spent time on attention.

The attention block is where the model compares the current state against carried context.

Then it normalizes those scores.

That normalization is not an isolated nuisance.

It sits inside the repeated stateful loop.

So if a proof system has a cleaner way to represent repeated lookup-like work, the advantage is not a one-off trick.

It can compound with the shape of decode.

This is also why the result should not be read as insensitive to assumptions.

It is sensitive to how softmax-like work is represented.

If a SNARK-side system uses a very good specialized lookup or approximation strategy, the gap narrows.

That is not a problem for the argument.

It is one of the main points:

the proof architecture should be judged by how naturally it handles the repeated non-arithmetic machinery, not only by how it handles dense linear algebra.

What happens as context grows

The GPT-2-small example at T=1024T = 1024 gives 1.48×1.48\times.

If the same model is scaled to T=4096T = 4096, the symbolic ratio rises to about 2.13×2.13\times.

That makes intuitive sense.

Longer context means more attention work.

More attention work means more places where softmax-like handling matters.

But the ratio does not grow forever.

The dense GPT-style ratio approaches a finite ceiling:

limTR(T)=2d+HCexp2d+H=2dh+Cexp2dh+1\begin{aligned} \lim_{T \to \infty} R(T) &= \frac{2d + H C_{\mathrm{exp}}}{2d + H} \\ &= \frac{2d_h + C_{\mathrm{exp}}}{2d_h + 1} \end{aligned}

For GPT-2 small, dh=64d_h = 64.

Under Cexp=300C_{\mathrm{exp}} = 300, that ceiling is about 3.32×3.32\times.

That is a useful result because it prevents the argument from becoming a vague “longer is always infinitely better” story.

The claim is more disciplined:

the ratio rises over practical context ranges and then saturates at a finite architecture-dependent ceiling.

That is exactly the kind of claim we can defend without pretending the number grows forever.

What changes for wider or sparse models

GPT-2 small is a clean teaching example.

It is not the whole world.

Wider models change the balance.

In a wider Llama-style dense reference, the dense arithmetic base is much larger.

That can keep the symbolic ratio closer to parity at shorter contexts, especially under lower softmax constants.

That matters because it prevents cherry-picking.

The stronger version of the argument does not hide that case.

It says:

wide production-style dense models can remain near parity at shorter contexts under lower softmax constants, while still widening materially at longer windows.

Sparse long-context models also change the picture.

A Gemma-style local/global sparse schedule reduces how much attention work grows at some positions.

That tempers the gap.

But if there is still a positive global-attention fraction, the same kind of finite long-context ceiling remains.

The sparse schedule delays the approach to the ceiling.

It does not erase the structural point.

This is the right level of honesty.

The point is not:

STARKs win every setting by a huge constant.

The point is:

transformer workloads concentrate important proving pressure in repeated non-arithmetic machinery, and trace-native systems are structurally well positioned for that pressure.

Next

The pressure map explains where proving work appears.

But a proof that verifies is still not enough if the surrounding claim can be relabeled.

Continue to Part 3: Proof Validity Is Not Statement Validity.