Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.inkwell.finance/llms.txt

Use this file to discover all available pages before exploring further.

CONFIDENTIAL & PROPRIETARY © 2026 Inkwell Finance, Inc. All Rights Reserved. This document is for informational purposes only and does not constitute legal, tax, or investment advice, nor an offer to sell or a solicitation to buy any security or other financial instrument. Any examples, structures, or flows described here are design intent only and may change.
Dagon’s single-batch match latency has a floor. Its per-batch amortized cost does not.

The property

FHE operations are O(1) per ciphertext regardless of how many of the ring’s slots carry meaningful data. A bootstrap on a ciphertext whose internal ring supports thousands of slots takes the same wall-clock whether all slots are populated or only a handful are. That leaves a large unused slice of ring capacity at the typical batch size — which we can fill by interleaving independent matches into the same ciphertexts. Every interleaved match rides the same compute pass.

What we see in practice

On a consumer-class GPU, the measured wall-clock for a single-batch match and a fully packed match (many concurrent batches interleaved into the same ring) differ by only a few percent. Everything else divides: the amortized cost per batch at full-ring pack is two to three orders of magnitude lower than the single-batch number. The GPU is already kernel-saturated on a single batch. Memory and compute throughput sit near 100% of device capacity during match. The remaining ring slots aren’t idle for lack of optimization — they’re idle because no batch is riding them. Interleaving fixes that.

Two packing modes

Stride-aware packing

Interleaves independent batches at regular slot positions inside one ciphertext. Rotation keys provisioned per stride. Nearly free amortization up to the ring’s capacity.

Multi-ciphertext loop

Runs the full graph repeatedly on separate ciphertexts. Useful only when concurrency needs exceed what stride-packing can fit in a single ring. Net-negative on a saturated GPU because per-graph setup cost dominates.
For realistic concurrent match volumes, the stride-aware path covers the whole working range with no meaningful wall-clock penalty.

Market-scale throughput

Two different numbers answer two different questions:
  • Single-batch latency — the UX number. How long does a trader wait for a batch to clear? Seconds. The measured H100 N=32 anchor is 7.77 s (every output decrypted and asserted).
  • Amortized-per-batch cost — the throughput number. How many match batches can the engine clear per unit of time at full packing? Milliseconds.
On a saturated datacenter GPU, the amortized cost can cross into sub-Solana-slot territory at full-ring pack: the engine clears a batch in less wall-clock than a block takes to produce. Network I/O and on-chain settlement become the scaling bottleneck at that point, not FHE.

A timing measurement, not yet a production guarantee

The amortization measurements are cost-correct: real kernels executing on real hardware, producing real wall-clocks. Semantic correctness of every packing variant against the plaintext reference is an ongoing engineering gate; some packing modes are still being hardened for production use. The throughput numbers are what the infrastructure can do — production rollout follows when the full correctness matrix is green.

Benchmarks

End-to-end wall-clock on consumer + datacenter GPU, and what drives it.

Interactive dashboard ↗

Toggle stride, scheme, and sign complexity to see the amortization curve shift against measured anchors.