Zilkworm in the ERE Benchmark

The ERE zkEVM benchmark workload offers a neutral, multi-client framework for measuring the correctness and performance of zkEVM implementations. Recently, we added support for Zilkworm and ran it head-to-head against Reth over (a subset of) the ERE benchmark suite.

This post presents the results. It covers what we integrated, how the comparison measurements were performed, and — without cherry-picking — where Zilkworm wins, where it falls short, and how you can reproduce the comparison yourself.

Background: a C++ EVM inside a zkVM

Most zkEVM projects are built around Rust execution engines. Zilkworm takes a different approach: the guest program that runs inside the zkVM is evmone, a battle-tested production-grade C++ EVM interpreter, cross-compiled to a bare-metal RISC-V ELF. Rust is still involved, but only as an orchestration layer around proving and verification.

Implementing a zkEVM involves a wide range of correctness challenges. We previously wrote about some of the harder parts, particularly around correctly verifying the execution witnesses.

Getting all the functional details right while maintaining strong performance is critical, but it's difficult to communicate how many trade-offs are at play. Benchmarking helps make them more visible and easier to reason about.

What the ERE benchmark measures

The ERE workload takes Ethereum Execution Spec Test (EEST) fixtures — self-contained blocks with their state witnesses — and runs each one through a stateless validation guest program inside a zkVM. Each execution client (Reth, Ethrex, and now Zilkworm) provides its own guest; the framework feeds all of them the same fixtures and records the total cycle count.

Two properties make this a fair comparison:

Same inputs, same fork. Every client validates identical blocks at the same fork, so any differences come from the guest, not the workload.
Cycles, not wall-clock. zkVM cycle counts are deterministic and host-independent. They ultimately determine proving cost, and they don't care whether you ran on a datacenter Linux box or, as we did, an emulated container on a laptop.

The integration

Adding Zilkworm to the workload took a small, self-contained change:

a new Zilkworm variant of the stateless-validation execution client;
a host-side adapter that re-encodes each fixture's stateless input into Zilkworm's unified-RLP bundle format, the wire format consumed by the C++ guest;
a download step that fetches the guest ELF and its verifying key directly from a pinned Zilkworm GitHub release tag.

That last point matters: the benchmark doesn't build Zilkworm from source, it consumes released artifacts. A successful run therefore validates the entire pipeline — the Zilkworm guest, its host adapter, and the release plumbing — in a single pass.

The run

Zilkworm currently consumes only the legacy ERE fixture format: support for the new EEST format will be added soon. To make the ERE benchmark runs more tractable in terms of execution time, we filtered the entire fixture set with --include 10M. We then executed the resulting 1077-fixture set (tests-benchmark@v0.0.9, 10M gas, Osaka) in SP1 execute mode for Zilkworm v0.1.0-alpha.2-ere and Reth v2.1.0, running on an Apple M1 Max under OrbStack. Full environment details and step-by-step instructions are described in our ERE integration doc.

Correctness first

Client	Completed	Outputs matched	Mismatches
Zilkworm `v0.1.0-alpha.2-ere`	1077 / 1077	1077	0
Reth `v2.1.0`	1077 / 1077	1077	0

Both clients complete every fixture and produce byte-identical public outputs. Before comparing performance, this is the baseline — and it holds.

Cycle comparison

Metric	Value
Total cycles — Reth	301,880,882,267
Total cycles — Zilkworm	159,156,117,864
Ratio (Zilkworm / Reth)	0.527 — Reth uses 1.90× as many cycles
Median per-fixture ratio	0.654
Fixtures where Zilkworm is cheaper	906 / 1077 (84%)

Across the suite, Zilkworm executes the same blocks in just over half the cycles and is cheaper on the vast majority of fixtures.

Where the difference comes from

The aggregate hides a more interesting story. Grouping by test family (ratio < 1 means Zilkworm is cheaper):

modexp                0.15      ← u256x2048 multiply syscall
bls12_381             0.23
alt_bn128             0.47
sha256                0.54
arithmetic            0.60
keccak                0.84
call_context          0.85
memory                0.98
-------------------------------- Reth cheaper below
block_context         1.97
ripemd160             3.06
p256verify           16.13      ← no dedicated syscall (yet)

The largest gains come from precompiles Zilkworm accelerates via SP1 syscalls. The most extreme case is modexp, where a hardware-accelerated 256×2048-bit multiply turns a ~1.2-billion-cycle Reth execution into ~26 million cycles — roughly a 48× reduction.

The losses are just as informative. p256verify (RIP-7212 secp256r1) costs Zilkworm ~16× more than Reth; RIPEMD-160 shows the same pattern with ~3×. A handful of very lightweight blocks (block_context opcodes) also favor Reth, where its smaller fixed per-block overhead wins on workloads that do little else.

We are reporting these openly because they directly shape the performance roadmap: the benchmark now tells us exactly what we should tackle next.

Reproduce it yourself

Everything here is reproducible from the Zilkworm repository. Two make targets drive the workload harness (cloning the fork, generating fixtures, running the SP1 guests):

# Validate Zilkworm only: 1077/1077 success + output match
make ere-validate

# Run both Zilkworm and Reth and print the cycle comparison
make ere-compare

make ere-compare prints the summary, per-family ratios, and best/worst fixtures. The detailed walkthrough — exact versions, fixture generation flags, and how to run memory-heavy fixtures on resource-constrained machines — is in our ERE integration doc.

Takeaways

Zilkworm validates the same 1077 EEST blocks as Reth, with identical outputs and ~1.9× fewer SP1 cycles overall.
The advantage is broad, but primarily seen in precompile-heavy execution.
The remaining gaps are specific, measured, and actionable.

The results are reproducible and speak for themselves.

Note: Fixture witnesses are regenerated on every run by the harness's witness-generator-cli, and that step is not bitwise deterministic — the accessed trie nodes are serialized in unordered-set order. Every variant is a valid witness, so absolute cycle totals may shift by ~0.01% run-to-run while the ratio and per-family breakdown stay stable.

Zilkworm is open source at github.com/erigontech/zilkworm.

Support for Zilkworm in the ERE benchmark was introduced in PR #291.

Zilkworm in the ERE Benchmark

Background: a C++ EVM inside a zkVM

What the ERE benchmark measures

The integration

The run

Correctness first

Cycle comparison

Where the difference comes from

Reproduce it yourself

Takeaways

Comments

More from this blog

On correctly verifying ExecutionWitness in Zilkworm

Command Palette

Background: a C++ EVM inside a zkVM

What the ERE benchmark measures

The integration

The run

Correctness first

Cycle comparison

Where the difference comes from

Reproduce it yourself

Takeaways

Comments

More from this blog