I accidentally made the fastest event system in the world
This is the story of rt-events, a ~200-LOC Rust crate I wrote because I wanted a nicely-typed in-process pub/sub. I did not set out to make it fast. I benchmarked it against ten other languages out of idle curiosity and it came out ahead of a monomorphized C++ game-dev library by roughly 4× at sub-fanout dispatch. The hot loop is six instructions. I cannot figure out how to make it go any faster. If you can, the puzzle is at the bottom.
Everything here is synchronous, single-threaded, same-process. No async brokers, no Kafka, no Redis. Fairness protocol: METHODOLOGY.md. Every number in this post cites a row in results/all.csv; if you disagree, the row is the source of truth.
What I actually wanted
In most languages, “publish/subscribe” means strings:
emitter.on("hit", (payload) => { /* was the field .damage? .Damage? .DAMAGE? */ })
The compiler has no idea what shape payload is. If you rename a field in the emitter, the callbacks silently break. If you typo the topic name, the handler never fires and nothing tells you. This is fine for ad-hoc scripts and it is what every event emitter I had ever used looked like.
I wanted the type system to do the work:
bus.on::<Hit>(|e: &Hit| { // e is definitely &Hit — compiler says so.
println!("took {} damage", e.damage);
});
bus.emit(Hit { damage: 7 }); // payload shape is Hit, or this is a type error.
That was the whole pitch. Type-checked dispatch. If the handler signature doesn’t match the emit site, the program doesn’t compile. Performance was not a goal — it was assumed to be “fine for an interactive simulation,” whatever that meant.
I wrote the obvious implementation in an afternoon:
HashMap<TypeId, Vec<Box<dyn Fn(&dyn Any)>>>
TypeId::of::<E>() as the map key, handlers as Box<dyn Fn(&dyn Any)> in a Vec, downcast at the handler boundary. Two hundred lines. Zero dependencies. Shipped.
Then I benchmarked it, for fun
I wrote a Criterion bench just to see if it was “fast enough.” The numbers were… surprisingly good. Tens of nanoseconds per single-subscriber dispatch. Single-digit nanoseconds for the no-subscriber miss path. I assumed that was what a competent compiler did for any small library and moved on.
Then the question crept in: is this actually good, or just good relative to my expectations? The only way to answer it was to run the same benches against every in-process event library I could get my hands on.
So I built rt-event-benches: 11 languages, one 14-scenario suite (spec: SPEC.md). Idiomatic libraries only — the thing an engineer in that ecosystem would actually reach for. C++ entt::dispatcher, Go asaskevich/EventBus, Node eventemitter3, CPython blinker, Ruby wisper, plus Java Guava, C# MediatR, Kotlin SharedFlow, Swift Combine, Elixir Registry.
Post-PR rt-events vs. the competitors (N=1 unless noted, same machine, same session, performance governor, taskset -c 2):
| Scenario | rt-events | C++ entt | Go EventBus | Node ee3 | CPython blinker | Ruby Wisper |
|---|---|---|---|---|---|---|
emit_zst/1 | 16 ns | 32 ns | 838 ns | 3,188 ns | 4,522 ns | 52,392 ns |
emit_zst/1000 | 1,289 ns | 5,724 ns | — | — | — | — |
emit_no_subs | 1 ns | 12 ns | 201 ns | 960 ns | 489 ns | 21,755 ns |
throughput_zst/1M (ops/sec) | 34.7 M | 18.2 M | 114 k | 1.8 M | n/a | n/a |
That was not the expected outcome. rt-events beats a monomorphized template-heavy C++ library on every dispatch scenario at N ≥ 10, and it beats every dynamic-language competitor by two-to-four orders of magnitude.
Five other entries (Java/Guava, C#/MediatR, Kotlin/SharedFlow, Swift/Combine, Elixir/Registry) are scaffolded and run clean, but the bench host lacked the runtime or the harness timed out; their rows appear in all.csv with unit=n/a. A future run on a warmer host will fill them in.
It took me two bugs to believe the C++ numbers
The first version of my C++ bench came back with emit_zst/1000 at 18 ns — less than the 1-subscriber number. That’s not physics, it’s a bug. Two bugs, actually, stacked:
-
entt’s sink deduplicates on
(candidate, payload). Look at sigh.hpp:409 —connect<Candidate>(payload...)first callsdisconnect<Candidate>(payload...). Registering the same free function 1000 times yields exactly one subscriber. I was measuring N=1 dispatch the whole time and calling it N=1000. Fix:connect<&handler_p>(payloads[i])with a distinct payload address per registration gives entt N genuinely distinct delegates. -
My throughput normalizer double-counted iterations. google/benchmark reports
real_timeper state iteration (not total); I was multiplying by iteration count on top of that, inflating reported ops/sec by ~15,000×. Fix: use google/benchmark’s ownitems_per_second, which it computes correctly fromstate.SetItemsProcessed.
Both bugs were in my harness, not in entt. entt’s numbers scale exactly how you’d expect from a well-written monomorphized library. And rt-events still wins.
The lesson isn’t “I had bugs” — it’s that a cross-language bench is really a consistency check on your own harness. If library A scales linearly and library B’s numbers do something impossible, your B harness is the one lying to you.
Where rt-events wins
Sub-fanout dispatch at every N ≥ 10.
| N | rt-events | entt | ratio |
|---|---|---|---|
| 1 | 16 ns | 32 ns | rt-events 2.0× |
| 10 | 23 ns | 91 ns | rt-events 4.0× |
| 100 | 152 ns | 630 ns | rt-events 4.1× |
| 1000 | 1,289 ns | 5,724 ns | rt-events 4.4× |
entt’s hot loop is a tight call *0x8(%rax) through its delegate table. rt-events’ is call *(%r15) through the trampoline pair (details below). Same shape. The gap is in the per-subscriber overhead: entt’s delegate has a short prologue that conditionally dispatches on whether it holds a free fn, free fn + payload, or member fn — a branch on every call. rt-events’ trampoline is one shape always.
Empty-path dispatch (emit_no_subs). rt-events returns in 1 ns when nothing is subscribed for the event type. entt takes 12 ns. Go 200, Node 960, Ruby 22 µs. This matters for “high-frequency telemetry no one happens to be listening to,” which is more common than people admit.
Tail latency. rt-events’ stddev stays within small single digits of the median across every dispatch scenario. Node’s p99 for emit_zst/1 is 15 ms against a 3.2 µs median — a 4,700× gap, explained entirely by V8’s stop-the-world GC landing on the bench path. A GC-free runtime doesn’t have that failure mode to begin with.
Where rt-events loses
entt wins exactly two scenarios, and they point at the same thing:
| scenario | rt-events | entt | winner |
|---|---|---|---|
emit_type_miss (100 wrong-type subs, emit 1) | 13 ns | 9 ns | entt 1.4× |
emit_with_10_types (10 types registered, emit 1) | 24 ns | 9 ns | entt 2.7× |
entt indexes its sighs in a type-parameterized container — dispatcher.trigger<T>() resolves at compile time to a bucket lookup. rt-events hashes TypeId::of::<E>() into a HashMap. At a single registered type the two look identical; once the bus holds 10 types, the hash work shows up. A Rust implementation keyed by a const-eval’d array index would close this gap, at the cost of some API awkwardness around const generics.
CPython. blinker under CPython 3.14 dispatches a single-sub ZST emit in ~4.5 µs — roughly 300× slower than rt-events. That’s the language, not the library. PyPy would close most of this gap; PyPy wasn’t on the bench host for this run.
Where the comparison is unfair, and I say so
Stringly-typed dispatchers pay hash-string cost I don’t. rt-events’ TypeId::of::<E>() is one instruction — a mov of a compile-time constant. eventemitter3, asaskevich/EventBus, Wisper, and blinker all hash the topic string on every publish. That’s not a bug in those libraries; it’s their contract. A fair like-for-like would be to a Rust crate with stringly-typed events; none of the popular ones fit.
Empty handler bodies can still be partially DCE’d. Both rt-events and entt compile handler bodies in a separate translation unit (Criterion does this implicitly; the C++ build uses a separate handlers.cpp with no LTO), and both use black_box / DoNotOptimize on observable payload fields. That stops cross-TU inlining. The emit_zst numbers are directionally right but shouldn’t be read as absolute per-instruction costs. emit_small (4 B payload) and emit_large (~80 B heap) are the discriminating scenarios.
Elixir Registry routes every dispatch through a process mailbox. It’s measured in microseconds, not nanoseconds, because it’s doing a different thing — cross-process pub/sub with isolation. Not apples to apples.
Kotlin MutableSharedFlow wraps every emit in coroutine scheduling. For a sync-single-threaded bus comparison, you’re paying for a feature you don’t use.
The PR that bought ~2×
Running my own suite against rt-events surfaced that rt-events had a problem too. emit_zst/1000 was 6.87 µs pre-PR. perf annotate pinned the indirect call through the Box<dyn Fn> vtable at ~48% of the hot-path time.
The original dispatch path:
emit::<E>
→ for each Box<dyn Fn(&dyn Any)> in subscribers[TypeId::of::<E>()]:
→ call through Box vtable (indirect call #1)
→ inside the closure, any.downcast_ref::<E>()
→ Any::type_id() via vtable (indirect call #2)
→ compare with TypeId::of::<E>()
→ reinterpret the pointer
→ user callback runs
Two indirect calls per subscriber, plus a runtime type check I already knew the answer to — the outer HashMap is keyed by TypeId, so every handler at subscribers[TypeId::of::<E>()] is known-for-E by construction. perf record -F 4999 --call-graph fp on emit_zst/1000 broke the time down as:
| % self | symbol | what |
|---|---|---|
| 43.7 % | EventBus::emit::<Tick> | dispatch loop |
| 41.1 % | on::{closure} wrapper | the Box<dyn Fn> body, running downcast_ref::<E>() |
| 14.4 % | <Tick as Any>::type_id | vtable call inside downcast_ref |
PR #1 replaces the boxed dyn Fn with a (data, fn_ptr) trampoline pair:
struct Subscriber {
data: *const (), // Box::<F>::into_raw
call: unsafe fn(*const (), *const ()), // call_trampoline::<E, F>
id: SubscriptionId,
drop: unsafe fn(*const ()), // drop_trampoline::<F>
}
unsafe fn call_trampoline<E: 'static, F: Fn(&E)>(data: *const (), event: *const ()) {
let f = unsafe { &*(data as *const F) };
let e = unsafe { &*(event as *const E) };
f(e);
}
Each trampoline is monomorphized at subscribe time for the concrete closure type F and event type E. At dispatch, we already know the type (vec index), so the downcast_ref is provably redundant — gone. Safety is discharged by a single invariant on the subscriber vec, proven in docs/internal/trampoline.md.
From the PR’s A/B run (sudo chrt -r 50 taskset -c 6, pre-PR saved as Criterion baseline, post-PR rerun against it):
| N | ZST event | small payload | large payload |
|---|---|---|---|
| 1 | 59 → 46 ns (noise) | 65 → 61 ns (noise) | 126 → 114 ns (noise) |
| 10 | 122 → 56 ns −55 % | 141 → 78 ns −50 % | 165 → 144 ns −39 % |
| 100 | 895 → 290 ns −58 % | 819 → 493 ns −61 % | 796 → 427 ns (noise) |
| 1000 | 6.87 → 3.62 µs −56 % | 8.65 → 4.37 µs −29 % | — |
All bolded rows are p < 0.05; (noise) = CI straddles zero.
Re-running the cross-language suite against the new rt-events confirms the magnitude:
| Scenario | Pre-PR | Post-PR |
|---|---|---|
emit_zst/1 | 47 ns | 16 ns |
emit_zst/10 | 152 ns | 23 ns |
emit_zst/100 | 836 ns | 152 ns |
emit_zst/1000 | 8,147 ns | 1,289 ns |
emit_no_subs | 3 ns | 1 ns |
throughput_zst/1M | 10.6 M ops/s | 34.7 M ops/s |
The miss paths were untouched by the PR (that code never entered the dyn Fn indirection); their ~1 ns / ~13 ns figures are what they always were.
Subscriber grew from 24 B to 32 B per subscription. Two unsafe blocks in the trampoline bodies and one at the dispatch site, each with a SAFETY: comment referencing the proof. Public API unchanged. All 13 unit tests and 3 doctests pass, plus two new tests covering the drop path.
The puzzle
Here is the current hot loop, post-trampoline, from objdump -d on the release binary:
.LBB_loop:
mov 0x10(%r15), %rdi ; load sub.data → arg0
mov %rbx, %rsi ; &event → arg1
call *(%r15) ; sub.call (indirect)
add $0x20, %r15 ; advance — sizeof(Subscriber) = 32 B
cmp %r14, %r15 ; end of vec?
jne .LBB_loop
Six instructions per subscriber. One indirect call. %r15 walks a contiguous Vec<Subscriber> (prefetcher-friendly). The call target is branch-predictable when subscribers share the same F. This is as tight as a dynamic dispatch loop gets on x86-64.
I have tried several things to make it smaller. None worked:
- Pack
Subscribertighter. It’s 32 B today (two fn pointers, one data pointer, one u64 id). Drop the id if you don’t care about cancelable subscriptions and it’s 24 B. Going below that means SoA layout, which trades one cache line of code for two cache misses on dispatch. - Batch by concrete
F. Group subscribers with the same closure type into separate buckets and iterate with a direct (non-indirect) call per bucket. But rt-events’ contract is “same event type, heterogeneous handlers” — you can register 10 different closures forHit, each capturing its own state. Batching byFbreaks that. - JIT the dispatch table. Emit a specialized trampoline with inlined direct calls at subscribe time via
cranelift. Works in principle, feels like cheating in a 200-LOC library, adds a JIT dep bigger than the library itself. call *(%r15)→call rel32. The indirect call is the cost. Well-predicted indirect calls are ~1 cycle throughput on modern x86, but that’s still a cycle per subscriber — and you can’t turn it into a direct call without knowing the target at compile time.
I don’t see how to get below six instructions without sacrificing the API (the typed, heterogeneous-handler contract) or reaching for a JIT.
If you can, please open a PR. The crate is rt-events, 200 lines of src/. The bench is rt-event-benches. The baseline to beat is emit_zst/1000 at 1,289 ns on an AMD Ryzen 5 5600X with the performance governor and taskset -c 2 pinning. If you get it meaningfully lower without breaking the typed-handler API, I’ll merge it and owe you a beer.
Submissions via PR; explanations welcome as issues. Nerd-snipe gladly received.