Documentation Index
Fetch the complete documentation index at: https://docs.inkwell.finance/llms.txt
Use this file to discover all available pages before exploring further.
CONFIDENTIAL & PROPRIETARY © 2026 Inkwell Finance, Inc. All Rights Reserved. This document is for informational purposes only and does not constitute legal, tax, or investment advice, nor an offer to sell or a solicitation to buy any security or other financial instrument. Any examples, structures, or flows described here are design intent only and may change.Dagon’s single-batch match latency has a floor. Its per-batch amortized cost does not.
The property
FHE operations are O(1) per ciphertext regardless of how many of the ring’s slots carry meaningful data. A bootstrap on a ciphertext whose internal ring supports thousands of slots takes the same wall-clock whether all slots are populated or only a handful are. That leaves a large unused slice of ring capacity at the typical batch size — which we can fill by interleaving independent matches into the same ciphertexts. Every interleaved match rides the same compute pass.What we see in practice
On a consumer-class GPU, the measured wall-clock for a single-batch match and a fully packed match (many concurrent batches interleaved into the same ring) differ by only a few percent. Everything else divides: the amortized cost per batch at full-ring pack is two to three orders of magnitude lower than the single-batch number. The GPU is already kernel-saturated on a single batch. Memory and compute throughput sit near 100% of device capacity during match. The remaining ring slots aren’t idle for lack of optimization — they’re idle because no batch is riding them. Interleaving fixes that.Two packing modes
Stride-aware packing
Interleaves independent batches at regular slot positions inside
one ciphertext. Rotation keys provisioned per stride. Nearly
free amortization up to the ring’s capacity.
Multi-ciphertext loop
Runs the full graph repeatedly on separate ciphertexts. Useful only
when concurrency needs exceed what stride-packing can fit in a
single ring. Net-negative on a saturated GPU because per-graph
setup cost dominates.
Market-scale throughput
Two different numbers answer two different questions:- Single-batch latency — the UX number. How long does a trader wait for a batch to clear? Seconds. The measured H100 N=32 anchor is 7.77 s (every output decrypted and asserted).
- Amortized-per-batch cost — the throughput number. How many match batches can the engine clear per unit of time at full packing? Milliseconds.
A timing measurement, not yet a production guarantee
The amortization measurements are cost-correct: real kernels executing on real hardware, producing real wall-clocks. Semantic correctness of every packing variant against the plaintext reference is an ongoing engineering gate; some packing modes are still being hardened for production use. The throughput numbers are what the infrastructure can do — production rollout follows when the full correctness matrix is green.Related
Benchmarks
End-to-end wall-clock on consumer + datacenter GPU, and what
drives it.
Interactive dashboard ↗
Toggle stride, scheme, and sign complexity to see the amortization
curve shift against measured anchors.