Orión

author

2026-04-18

I accidentally made the fastest event system in the world

This is the story of rt-events, a ~200-LOC Rust crate I wrote because I wanted a nicely-typed in-process pub/sub. I did not set out to make it fast. I benchmarked it against ten other languages out of idle curiosity and it came out ahead of a monomorphized C++ game-dev library by roughly 4× at sub-fanout dispatch. The hot loop is six instructions. I cannot figure out how to make it go any faster. If you can, the puzzle is at the bottom.

Everything here is synchronous, single-threaded, same-process. No async brokers, no Kafka, no Redis. Fairness protocol: METHODOLOGY.md. Every number in this post cites a row in results/all.csv; if you disagree, the row is the source of truth.

Data lab

Methodology, but instrumented

The static article cites curated benchmark rows, but the blog now carries those rows locally and indexes them into SQLite so this essay can expose them as an explorable table instead of a dead screenshot. The source of truth is still the benchmark repo; the snapshot here is curated for deterministic builds and faster reading.

What I actually wanted

In most languages, “publish/subscribe” means strings:

emitter.on("hit", (payload) => { /* was the field .damage? .Damage? .DAMAGE? */ })

The compiler has no idea what shape payload is. If you rename a field in the emitter, the callbacks silently break. If you typo the topic name, the handler never fires and nothing tells you. This is fine for ad-hoc scripts and it is what every event emitter I had ever used looked like.

I wanted the type system to do the work:

bus.on::<Hit>(|e: &Hit| {               // e is definitely &Hit — compiler says so.
    println!("took {} damage", e.damage);
});

bus.emit(Hit { damage: 7 });            // payload shape is Hit, or this is a type error.

That was the whole pitch. Type-checked dispatch. If the handler signature doesn’t match the emit site, the program doesn’t compile. Performance was not a goal — it was assumed to be “fine for an interactive simulation,” whatever that meant.

I wrote the obvious implementation in an afternoon:

HashMap<TypeId, Vec<Box<dyn Fn(&dyn Any)>>>

TypeId::of::<E>() as the map key, handlers as Box<dyn Fn(&dyn Any)> in a Vec, downcast at the handler boundary. Two hundred lines. Zero dependencies. Shipped.

Then I benchmarked it, for fun

I wrote a Criterion bench just to see if it was “fast enough.” The numbers were… surprisingly good. Tens of nanoseconds per single-subscriber dispatch. Single-digit nanoseconds for the no-subscriber miss path. I assumed that was what a competent compiler did for any small library and moved on.

Then the question crept in: is this actually good, or just good relative to my expectations? The only way to answer it was to run the same benches against every in-process event library I could get my hands on.

So I built rt-event-benches: 11 languages, one 14-scenario suite (spec: SPEC.md). Idiomatic libraries only — the thing an engineer in that ecosystem would actually reach for. C++ entt::dispatcher, Go asaskevich/EventBus, Node eventemitter3, CPython blinker, Ruby wisper, plus Java Guava, C# MediatR, Kotlin SharedFlow, Swift Combine, Elixir Registry.

Post-PR rt-events vs. the competitors (N=1 unless noted, same machine, same session, performance governor, taskset -c 2):

benchmark

Dispatch lab

Switch focus between the competitor columns and filter the table between latency and throughput rows. Every evidence chip is backed by the curated benchmark snapshot.

rows

Sort

rt-events

Source

rt-event-benches / results/all.csv

family

Focus

Miss path with no listeners. It matters for high-frequency telemetry where listeners are optional.


emit_no_subs	ns	1	12	201	960	489	21,755
emit_zst/1	ns	16	32	838	3,188	4,522	52,392
emit_zst/1000	ns	1,289	5,724	—	—	—	—
throughput_zst/1M	ops/sec	34,700,000	18,200,000	114,000	1,800,000	—	—

That was not the expected outcome. rt-events beats a monomorphized template-heavy C++ library on every dispatch scenario at N ≥ 10, and it beats every dynamic-language competitor by two-to-four orders of magnitude.

Five other entries (Java/Guava, C#/MediatR, Kotlin/SharedFlow, Swift/Combine, Elixir/Registry) are scaffolded and run clean, but the bench host lacked the runtime or the harness timed out; their rows appear in all.csv with unit=n/a. A future run on a warmer host will fill them in.

It took me two bugs to believe the C++ numbers

The first version of my C++ bench came back with emit_zst/1000 at 18 ns — less than the 1-subscriber number. That’s not physics, it’s a bug. Two bugs, actually, stacked:

entt’s sink deduplicates on (candidate, payload). Look at sigh.hpp:409 — connect<Candidate>(payload...) first calls disconnect<Candidate>(payload...). Registering the same free function 1000 times yields exactly one subscriber. I was measuring N=1 dispatch the whole time and calling it N=1000. Fix: connect<&handler_p>(payloads[i]) with a distinct payload address per registration gives entt N genuinely distinct delegates.
My throughput normalizer double-counted iterations. google/benchmark reports real_time per state iteration (not total); I was multiplying by iteration count on top of that, inflating reported ops/sec by ~15,000×. Fix: use google/benchmark’s own items_per_second, which it computes correctly from state.SetItemsProcessed.

Both bugs were in my harness, not in entt. entt’s numbers scale exactly how you’d expect from a well-written monomorphized library. And rt-events still wins.

The row that made me stop and fix the harness was Cross-language dispatch snapshot results/all.csv :: emit_zst/1000 High-fanout case. The article focuses on this because it exposes per-subscriber dispatch overhead directly. rt-event-benches / results/all.csv View source . Once that one looked impossible, everything else had to be re-audited.

The lesson isn’t “I had bugs” — it’s that a cross-language bench is really a consistency check on your own harness. If library A scales linearly and library B’s numbers do something impossible, your B harness is the one lying to you.

Where rt-events wins

Sub-fanout dispatch at every N ≥ 10.

benchmark

Scaling lab

This is the distilled rt-events vs entt loop. Sort by subscriber count or by ratio to see where the trampoline layout starts compounding.

rows

Sort

Subscribers

Source

rt-event-benches / results/all.csv

Focus

At one subscriber the gap is real but modest; the bigger story is what happens as N grows.


1	ns	16	32	2.0×
10	ns	23	91	4.0×
100	ns	152	630	4.1×
1,000	ns	1,289	5,724	4.4×

entt’s hot loop is a tight call *0x8(%rax) through its delegate table. rt-events’ is call *(%r15) through the trampoline pair (details below). Same shape. The gap is in the per-subscriber overhead: entt’s delegate has a short prologue that conditionally dispatches on whether it holds a free fn, free fn + payload, or member fn — a branch on every call. rt-events’ trampoline is one shape always.

Empty-path dispatch (emit_no_subs). rt-events returns in 1 ns when nothing is subscribed for the event type. entt takes 12 ns. Go 200, Node 960, Ruby 22 µs. This matters for “high-frequency telemetry no one happens to be listening to,” which is more common than people admit.

Tail latency. rt-events’ stddev stays within small single digits of the median across every dispatch scenario. Node’s p99 for emit_zst/1 is 15 ms against a 3.2 µs median — a 4,700× gap, explained entirely by V8’s stop-the-world GC landing on the bench path. A GC-free runtime doesn’t have that failure mode to begin with.

Where rt-events loses

entt wins exactly two scenarios, and they point at the same thing:

scenario	rt-events	entt	winner
`emit_type_miss` (100 wrong-type subs, emit 1)	13 ns	9 ns	entt 1.4×
`emit_with_10_types` (10 types registered, emit 1)	24 ns	9 ns	entt 2.7×

entt indexes its sighs in a type-parameterized container — dispatcher.trigger<T>() resolves at compile time to a bucket lookup. rt-events hashes TypeId::of::<E>() into a HashMap. At a single registered type the two look identical; once the bus holds 10 types, the hash work shows up. A Rust implementation keyed by a const-eval’d array index would close this gap, at the cost of some API awkwardness around const generics.

CPython. blinker under CPython 3.14 dispatches a single-sub ZST emit in ~4.5 µs — roughly 300× slower than rt-events. That’s the language, not the library. PyPy would close most of this gap; PyPy wasn’t on the bench host for this run.

Where the comparison is unfair, and I say so

Stringly-typed dispatchers pay hash-string cost I don’t. rt-events’ TypeId::of::<E>() is one instruction — a mov of a compile-time constant. eventemitter3, asaskevich/EventBus, Wisper, and blinker all hash the topic string on every publish. That’s not a bug in those libraries; it’s their contract. A fair like-for-like would be to a Rust crate with stringly-typed events; none of the popular ones fit.

Empty handler bodies can still be partially DCE’d. Both rt-events and entt compile handler bodies in a separate translation unit (Criterion does this implicitly; the C++ build uses a separate handlers.cpp with no LTO), and both use black_box / DoNotOptimize on observable payload fields. That stops cross-TU inlining. The emit_zst numbers are directionally right but shouldn’t be read as absolute per-instruction costs. emit_small (4 B payload) and emit_large (~80 B heap) are the discriminating scenarios.

Elixir Registry routes every dispatch through a process mailbox. It’s measured in microseconds, not nanoseconds, because it’s doing a different thing — cross-process pub/sub with isolation. Not apples to apples.

Kotlin MutableSharedFlow wraps every emit in coroutine scheduling. For a sync-single-threaded bus comparison, you’re paying for a feature you don’t use.

The PR that bought ~2×

Running my own suite against rt-events surfaced that rt-events had a problem too. emit_zst/1000 was 6.87 µs pre-PR. perf annotate pinned the indirect call through the Box<dyn Fn> vtable at ~48% of the hot-path time.

The dispatch shape changed more than the public API did

Pre-PR

emit::<E>
  → for each Box<dyn Fn(&dyn Any)> in subscribers[TypeId::of::<E>()]:
      → call through Box vtable              (indirect call #1)
      → inside the closure, any.downcast_ref::<E>()
          → Any::type_id() via vtable        (indirect call #2)
          → compare with TypeId::of::<E>()
          → reinterpret the pointer
      → user callback runs

Post-PR

struct Subscriber {
    data: *const (),
    call: unsafe fn(*const (), *const ()),
    id:   SubscriptionId,
    drop: unsafe fn(*const ()),
}

unsafe fn call_trampoline<E: 'static, F: Fn(&E)>(data: *const (), event: *const ()) {
    let f = unsafe { &*(data as *const F) };
    let e = unsafe { &*(event as *const E) };
    f(e);
}

Two indirect calls per subscriber, plus a runtime type check I already knew the answer to — the outer HashMap is keyed by TypeId, so every handler at subscribers[TypeId::of::<E>()] is known-for-E by construction. perf record -F 4999 --call-graph fp on emit_zst/1000 broke the time down as:

% self	symbol	what
43.7 %	`EventBus::emit::<Tick>`	dispatch loop
41.1 %	`on::{closure}` wrapper	the `Box<dyn Fn>` body, running `downcast_ref::<E>()`
14.4 %	`<Tick as Any>::type_id`	vtable call inside `downcast_ref`

PR #1 replaces the boxed dyn Fn with a (data, fn_ptr) trampoline pair:

struct Subscriber {
    data: *const (),                            // Box::<F>::into_raw
    call: unsafe fn(*const (), *const ()),      // call_trampoline::<E, F>
    id:   SubscriptionId,
    drop: unsafe fn(*const ()),                 // drop_trampoline::<F>
}

unsafe fn call_trampoline<E: 'static, F: Fn(&E)>(data: *const (), event: *const ()) {
    let f = unsafe { &*(data as *const F) };
    let e = unsafe { &*(event as *const E) };
    f(e);
}

Each trampoline is monomorphized at subscribe time for the concrete closure type F and event type E. At dispatch, we already know the type (vec index), so the downcast_ref is provably redundant — gone. Safety is discharged by a single invariant on the subscriber vec, proven in docs/internal/trampoline.md.

From the PR’s A/B run (sudo chrt -r 50 taskset -c 6, pre-PR saved as Criterion baseline, post-PR rerun against it):

benchmark

PR impact lab

The benchmark snapshot behind the trampoline rewrite. Sort by subscriber count or focus on one payload family to see where the PR actually moved the curve.

rows

Sort

Subscribers

Source

rt-events PR #1

Focus

At N=1 the change is mostly noise, which matches the claim that the miss path and tiny loops were already tight.


1	ns	59	46	65	61	126	114
10	ns	122	56	141	78	165	144
100	ns	895	290	819	493	796	427
1,000	ns	6,870	3,620	8,650	4,370	—	—

All bolded rows are p < 0.05; (noise) = CI straddles zero.

Re-running the cross-language suite against the new rt-events confirms the magnitude:

Scenario	Pre-PR	Post-PR
`emit_zst/1`	47 ns	16 ns
`emit_zst/10`	152 ns	23 ns
`emit_zst/100`	836 ns	152 ns
`emit_zst/1000`	8,147 ns	1,289 ns
`emit_no_subs`	3 ns	1 ns
`throughput_zst/1M`	10.6 M ops/s	34.7 M ops/s

The miss paths were untouched by the PR (that code never entered the dyn Fn indirection); their ~1 ns / ~13 ns figures are what they always were.

Subscriber grew from 24 B to 32 B per subscription. Two unsafe blocks in the trampoline bodies and one at the dispatch site, each with a SAFETY: comment referencing the proof. Public API unchanged. All 13 unit tests and 3 doctests pass, plus two new tests covering the drop path.

The puzzle

Here is the current hot loop, post-trampoline, from objdump -d on the release binary:

.LBB_loop:
    mov    0x10(%r15), %rdi    ; load sub.data        → arg0
    mov    %rbx, %rsi           ; &event              → arg1
    call   *(%r15)              ; sub.call (indirect)
    add    $0x20, %r15           ; advance — sizeof(Subscriber) = 32 B
    cmp    %r14, %r15            ; end of vec?
    jne    .LBB_loop

Six instructions per subscriber. One indirect call. %r15 walks a contiguous Vec<Subscriber> (prefetcher-friendly). The call target is branch-predictable when subscribers share the same F. This is as tight as a dynamic dispatch loop gets on x86-64.

I have tried several things to make it smaller. None worked:

Pack Subscriber tighter. It’s 32 B today (two fn pointers, one data pointer, one u64 id). Drop the id if you don’t care about cancelable subscriptions and it’s 24 B. Going below that means SoA layout, which trades one cache line of code for two cache misses on dispatch.
Batch by concrete F. Group subscribers with the same closure type into separate buckets and iterate with a direct (non-indirect) call per bucket. But rt-events’ contract is “same event type, heterogeneous handlers” — you can register 10 different closures for Hit, each capturing its own state. Batching by F breaks that.
JIT the dispatch table. Emit a specialized trampoline with inlined direct calls at subscribe time via cranelift. Works in principle, feels like cheating in a 200-LOC library, adds a JIT dep bigger than the library itself.
call *(%r15) → call rel32. The indirect call is the cost. Well-predicted indirect calls are ~1 cycle throughput on modern x86, but that’s still a cycle per subscriber — and you can’t turn it into a direct call without knowing the target at compile time.

I don’t see how to get below six instructions without sacrificing the API (the typed, heterogeneous-handler contract) or reaching for a JIT.

If you can, please open a PR. The crate is rt-events, 200 lines of src/. The bench is rt-event-benches. The baseline to beat is emit_zst/1000 at 1,289 ns on an AMD Ryzen 5 5600X with the performance governor and taskset -c 2 pinning. If you get it meaningfully lower without breaking the typed-handler API, I’ll merge it and owe you a beer.

Submissions via PR; explanations welcome as issues. Nerd-snipe gladly received.

Reactions

Tags: #rust #performance #benchmarks #events #optimization

Discussion

Public notes from logged-in readers.

Loading comments…

stderr