O

Orión

notes / code / chaos

I accidentally made the fastest event system in the world

This is the story of rt-events, a ~200-LOC Rust crate I wrote because I wanted a nicely-typed in-process pub/sub. I did not set out to make it fast. I benchmarked it against ten other languages out of idle curiosity and it came out ahead of a monomorphized C++ game-dev library by roughly 4× at sub-fanout dispatch. The hot loop is six instructions. I cannot figure out how to make it go any faster. If you can, the puzzle is at the bottom.

Everything here is synchronous, single-threaded, same-process. No async brokers, no Kafka, no Redis. Fairness protocol: METHODOLOGY.md. Every number in this post cites a row in results/all.csv; if you disagree, the row is the source of truth.

What I actually wanted

In most languages, “publish/subscribe” means strings:

emitter.on("hit", (payload) => { /* was the field .damage? .Damage? .DAMAGE? */ })

The compiler has no idea what shape payload is. If you rename a field in the emitter, the callbacks silently break. If you typo the topic name, the handler never fires and nothing tells you. This is fine for ad-hoc scripts and it is what every event emitter I had ever used looked like.

I wanted the type system to do the work:

bus.on::<Hit>(|e: &Hit| {               // e is definitely &Hit — compiler says so.
    println!("took {} damage", e.damage);
});

bus.emit(Hit { damage: 7 });            // payload shape is Hit, or this is a type error.

That was the whole pitch. Type-checked dispatch. If the handler signature doesn’t match the emit site, the program doesn’t compile. Performance was not a goal — it was assumed to be “fine for an interactive simulation,” whatever that meant.

I wrote the obvious implementation in an afternoon:

HashMap<TypeId, Vec<Box<dyn Fn(&dyn Any)>>>

TypeId::of::<E>() as the map key, handlers as Box<dyn Fn(&dyn Any)> in a Vec, downcast at the handler boundary. Two hundred lines. Zero dependencies. Shipped.

Then I benchmarked it, for fun

I wrote a Criterion bench just to see if it was “fast enough.” The numbers were… surprisingly good. Tens of nanoseconds per single-subscriber dispatch. Single-digit nanoseconds for the no-subscriber miss path. I assumed that was what a competent compiler did for any small library and moved on.

Then the question crept in: is this actually good, or just good relative to my expectations? The only way to answer it was to run the same benches against every in-process event library I could get my hands on.

So I built rt-event-benches: 11 languages, one 14-scenario suite (spec: SPEC.md). Idiomatic libraries only — the thing an engineer in that ecosystem would actually reach for. C++ entt::dispatcher, Go asaskevich/EventBus, Node eventemitter3, CPython blinker, Ruby wisper, plus Java Guava, C# MediatR, Kotlin SharedFlow, Swift Combine, Elixir Registry.

Post-PR rt-events vs. the competitors (N=1 unless noted, same machine, same session, performance governor, taskset -c 2):

Scenariort-eventsC++ enttGo EventBusNode ee3CPython blinkerRuby Wisper
emit_zst/116 ns32 ns838 ns3,188 ns4,522 ns52,392 ns
emit_zst/10001,289 ns5,724 ns
emit_no_subs1 ns12 ns201 ns960 ns489 ns21,755 ns
throughput_zst/1M (ops/sec)34.7 M18.2 M114 k1.8 Mn/an/a

That was not the expected outcome. rt-events beats a monomorphized template-heavy C++ library on every dispatch scenario at N ≥ 10, and it beats every dynamic-language competitor by two-to-four orders of magnitude.

Five other entries (Java/Guava, C#/MediatR, Kotlin/SharedFlow, Swift/Combine, Elixir/Registry) are scaffolded and run clean, but the bench host lacked the runtime or the harness timed out; their rows appear in all.csv with unit=n/a. A future run on a warmer host will fill them in.

It took me two bugs to believe the C++ numbers

The first version of my C++ bench came back with emit_zst/1000 at 18 nsless than the 1-subscriber number. That’s not physics, it’s a bug. Two bugs, actually, stacked:

  1. entt’s sink deduplicates on (candidate, payload). Look at sigh.hpp:409connect<Candidate>(payload...) first calls disconnect<Candidate>(payload...). Registering the same free function 1000 times yields exactly one subscriber. I was measuring N=1 dispatch the whole time and calling it N=1000. Fix: connect<&handler_p>(payloads[i]) with a distinct payload address per registration gives entt N genuinely distinct delegates.

  2. My throughput normalizer double-counted iterations. google/benchmark reports real_time per state iteration (not total); I was multiplying by iteration count on top of that, inflating reported ops/sec by ~15,000×. Fix: use google/benchmark’s own items_per_second, which it computes correctly from state.SetItemsProcessed.

Both bugs were in my harness, not in entt. entt’s numbers scale exactly how you’d expect from a well-written monomorphized library. And rt-events still wins.

The lesson isn’t “I had bugs” — it’s that a cross-language bench is really a consistency check on your own harness. If library A scales linearly and library B’s numbers do something impossible, your B harness is the one lying to you.

Where rt-events wins

Sub-fanout dispatch at every N ≥ 10.

Nrt-eventsenttratio
116 ns32 nsrt-events 2.0×
1023 ns91 nsrt-events 4.0×
100152 ns630 nsrt-events 4.1×
10001,289 ns5,724 nsrt-events 4.4×

entt’s hot loop is a tight call *0x8(%rax) through its delegate table. rt-events’ is call *(%r15) through the trampoline pair (details below). Same shape. The gap is in the per-subscriber overhead: entt’s delegate has a short prologue that conditionally dispatches on whether it holds a free fn, free fn + payload, or member fn — a branch on every call. rt-events’ trampoline is one shape always.

Empty-path dispatch (emit_no_subs). rt-events returns in 1 ns when nothing is subscribed for the event type. entt takes 12 ns. Go 200, Node 960, Ruby 22 µs. This matters for “high-frequency telemetry no one happens to be listening to,” which is more common than people admit.

Tail latency. rt-events’ stddev stays within small single digits of the median across every dispatch scenario. Node’s p99 for emit_zst/1 is 15 ms against a 3.2 µs median — a 4,700× gap, explained entirely by V8’s stop-the-world GC landing on the bench path. A GC-free runtime doesn’t have that failure mode to begin with.

Where rt-events loses

entt wins exactly two scenarios, and they point at the same thing:

scenariort-eventsenttwinner
emit_type_miss (100 wrong-type subs, emit 1)13 ns9 nsentt 1.4×
emit_with_10_types (10 types registered, emit 1)24 ns9 nsentt 2.7×

entt indexes its sighs in a type-parameterized container — dispatcher.trigger<T>() resolves at compile time to a bucket lookup. rt-events hashes TypeId::of::<E>() into a HashMap. At a single registered type the two look identical; once the bus holds 10 types, the hash work shows up. A Rust implementation keyed by a const-eval’d array index would close this gap, at the cost of some API awkwardness around const generics.

CPython. blinker under CPython 3.14 dispatches a single-sub ZST emit in ~4.5 µs — roughly 300× slower than rt-events. That’s the language, not the library. PyPy would close most of this gap; PyPy wasn’t on the bench host for this run.

Where the comparison is unfair, and I say so

Stringly-typed dispatchers pay hash-string cost I don’t. rt-events’ TypeId::of::<E>() is one instruction — a mov of a compile-time constant. eventemitter3, asaskevich/EventBus, Wisper, and blinker all hash the topic string on every publish. That’s not a bug in those libraries; it’s their contract. A fair like-for-like would be to a Rust crate with stringly-typed events; none of the popular ones fit.

Empty handler bodies can still be partially DCE’d. Both rt-events and entt compile handler bodies in a separate translation unit (Criterion does this implicitly; the C++ build uses a separate handlers.cpp with no LTO), and both use black_box / DoNotOptimize on observable payload fields. That stops cross-TU inlining. The emit_zst numbers are directionally right but shouldn’t be read as absolute per-instruction costs. emit_small (4 B payload) and emit_large (~80 B heap) are the discriminating scenarios.

Elixir Registry routes every dispatch through a process mailbox. It’s measured in microseconds, not nanoseconds, because it’s doing a different thing — cross-process pub/sub with isolation. Not apples to apples.

Kotlin MutableSharedFlow wraps every emit in coroutine scheduling. For a sync-single-threaded bus comparison, you’re paying for a feature you don’t use.

The PR that bought ~2×

Running my own suite against rt-events surfaced that rt-events had a problem too. emit_zst/1000 was 6.87 µs pre-PR. perf annotate pinned the indirect call through the Box<dyn Fn> vtable at ~48% of the hot-path time.

The original dispatch path:

emit::<E>
  → for each Box<dyn Fn(&dyn Any)> in subscribers[TypeId::of::<E>()]:
      → call through Box vtable              (indirect call #1)
      → inside the closure, any.downcast_ref::<E>()
          → Any::type_id() via vtable        (indirect call #2)
          → compare with TypeId::of::<E>()
          → reinterpret the pointer
      → user callback runs

Two indirect calls per subscriber, plus a runtime type check I already knew the answer to — the outer HashMap is keyed by TypeId, so every handler at subscribers[TypeId::of::<E>()] is known-for-E by construction. perf record -F 4999 --call-graph fp on emit_zst/1000 broke the time down as:

% selfsymbolwhat
43.7 %EventBus::emit::<Tick>dispatch loop
41.1 %on::{closure} wrapperthe Box<dyn Fn> body, running downcast_ref::<E>()
14.4 %<Tick as Any>::type_idvtable call inside downcast_ref

PR #1 replaces the boxed dyn Fn with a (data, fn_ptr) trampoline pair:

struct Subscriber {
    data: *const (),                            // Box::<F>::into_raw
    call: unsafe fn(*const (), *const ()),      // call_trampoline::<E, F>
    id:   SubscriptionId,
    drop: unsafe fn(*const ()),                 // drop_trampoline::<F>
}

unsafe fn call_trampoline<E: 'static, F: Fn(&E)>(data: *const (), event: *const ()) {
    let f = unsafe { &*(data as *const F) };
    let e = unsafe { &*(event as *const E) };
    f(e);
}

Each trampoline is monomorphized at subscribe time for the concrete closure type F and event type E. At dispatch, we already know the type (vec index), so the downcast_ref is provably redundant — gone. Safety is discharged by a single invariant on the subscriber vec, proven in docs/internal/trampoline.md.

From the PR’s A/B run (sudo chrt -r 50 taskset -c 6, pre-PR saved as Criterion baseline, post-PR rerun against it):

NZST eventsmall payloadlarge payload
159 → 46 ns (noise)65 → 61 ns (noise)126 → 114 ns (noise)
10122 → 56 ns −55 %141 → 78 ns −50 %165 → 144 ns −39 %
100895 → 290 ns −58 %819 → 493 ns −61 %796 → 427 ns (noise)
10006.87 → 3.62 µs −56 %8.65 → 4.37 µs −29 %

All bolded rows are p < 0.05; (noise) = CI straddles zero.

Re-running the cross-language suite against the new rt-events confirms the magnitude:

ScenarioPre-PRPost-PR
emit_zst/147 ns16 ns
emit_zst/10152 ns23 ns
emit_zst/100836 ns152 ns
emit_zst/10008,147 ns1,289 ns
emit_no_subs3 ns1 ns
throughput_zst/1M10.6 M ops/s34.7 M ops/s

The miss paths were untouched by the PR (that code never entered the dyn Fn indirection); their ~1 ns / ~13 ns figures are what they always were.

Subscriber grew from 24 B to 32 B per subscription. Two unsafe blocks in the trampoline bodies and one at the dispatch site, each with a SAFETY: comment referencing the proof. Public API unchanged. All 13 unit tests and 3 doctests pass, plus two new tests covering the drop path.

The puzzle

Here is the current hot loop, post-trampoline, from objdump -d on the release binary:

.LBB_loop:
    mov    0x10(%r15), %rdi    ; load sub.data        → arg0
    mov    %rbx, %rsi           ; &event              → arg1
    call   *(%r15)              ; sub.call (indirect)
    add    $0x20, %r15           ; advance — sizeof(Subscriber) = 32 B
    cmp    %r14, %r15            ; end of vec?
    jne    .LBB_loop

Six instructions per subscriber. One indirect call. %r15 walks a contiguous Vec<Subscriber> (prefetcher-friendly). The call target is branch-predictable when subscribers share the same F. This is as tight as a dynamic dispatch loop gets on x86-64.

I have tried several things to make it smaller. None worked:

  • Pack Subscriber tighter. It’s 32 B today (two fn pointers, one data pointer, one u64 id). Drop the id if you don’t care about cancelable subscriptions and it’s 24 B. Going below that means SoA layout, which trades one cache line of code for two cache misses on dispatch.
  • Batch by concrete F. Group subscribers with the same closure type into separate buckets and iterate with a direct (non-indirect) call per bucket. But rt-events’ contract is “same event type, heterogeneous handlers” — you can register 10 different closures for Hit, each capturing its own state. Batching by F breaks that.
  • JIT the dispatch table. Emit a specialized trampoline with inlined direct calls at subscribe time via cranelift. Works in principle, feels like cheating in a 200-LOC library, adds a JIT dep bigger than the library itself.
  • call *(%r15)call rel32. The indirect call is the cost. Well-predicted indirect calls are ~1 cycle throughput on modern x86, but that’s still a cycle per subscriber — and you can’t turn it into a direct call without knowing the target at compile time.

I don’t see how to get below six instructions without sacrificing the API (the typed, heterogeneous-handler contract) or reaching for a JIT.

If you can, please open a PR. The crate is rt-events, 200 lines of src/. The bench is rt-event-benches. The baseline to beat is emit_zst/1000 at 1,289 ns on an AMD Ryzen 5 5600X with the performance governor and taskset -c 2 pinning. If you get it meaningfully lower without breaking the typed-handler API, I’ll merge it and owe you a beer.

Submissions via PR; explanations welcome as issues. Nerd-snipe gladly received.