2026-04-18

Your benchmark says it's fast. My benchmark asks where it breaks.

A “fast” event bus in a happy-path microbench tells you nothing about how the library falls apart when you push it. I wrote a companion stress suite for rt-event-benches because I wanted to know exactly where each library gives up, and what shape the failure takes.

Five languages, five libraries, five ways to fail. No library survives every axis. None of them fail the same way.

All the data and code is in the repo. The charts below come from results/stress.csv (149 rows across five libraries) and scripts/plot-stress.py. The spec is STRESS.md.

The setup

Same cast as the first post: Rust rt-events, C++ entt::dispatcher, Go asaskevich/EventBus, Node eventemitter3, Python blinker. Five stress axes, each a geometric ramp (×10 per step). The harness runs each step with taskset, timeout 90s, and /usr/bin/time -v. A crash stops the ramp; the last surviving value is the breaking point.

The five axes:

subscriber_count: N handlers for one event, dispatch once. Ramp: 10 → 100M.
payload_size_bytes: one event with an N-byte payload. Ramp: 8 B → 16 GiB.
sustained_rate: N emits in a tight loop to 10 subs, sample p99/median. Ramp: 10k → 1B.
recursive_emit_depth: handler re-emits itself D times. Ramp: 1 → 1M.
unsubscribe_during_dispatch: three pass/fail probes (self, ahead, behind).

Full contract: STRESS.md.

The death matrix

library	subscriber_count	payload_size_bytes	sustained_rate	recursive_emit_depth	unsubscribe_during_dispatch
cpp/entt	hang @ 1M	oom @ 16G	hang @ 1B	segfault @ 100k	ok
go/EventBus	hang @ 100M	oom @ 16G	hang @ 10M	wrong_result (deadlock @ depth=2)	ok
node/EE3	hang @ 100M	”ok” @ 16G	hang @ 100M	panic @ 10k	ok
python/blinker	hang @ 10M	oom @ 16G	hang @ 10M	ok @ 1M (no crash at max tested)	ok
rust/rt-events	ok @ 100M (no crash at max tested)	oom @ 16G	hang @ 1B	stack_overflow @ 1M	ok (statically disallowed)

Read “ok @ N” as “completed the largest value we tested without crashing, so we don’t know where its real breaking point is.” Read “hang” as “didn’t finish in 90 seconds.” Everything else is a real crash with a specific signal or exception.

The grid is brutal: even at 149 data points, the five libraries crash in five different shapes. That’s the story.

Subscriber count: flat beats clever

subscriber_count

This axis registers N handlers for one event and then emits once. It’s the simplest thing a dispatcher has to do, and it exposes the library’s internal data structure more directly than any other axis.

rt-events completed 100M subscribers in 3.1 seconds and didn’t crash. The hot path is for fn in vec { (fn.trampoline)(fn.data, &event) } — a flat Vec<(data_ptr, fn_ptr)> walk. No hashing, no reflection, no GC barriers. 100M pointer chases fit in ~1.6 GiB of contiguous RAM.
Node EE3 hung at 100M. Last completed value: 10M in 2.3 seconds. EE3 stores listeners in a JS array; at 10M entries you’re paying for V8’s backing-store growth and its GC visiting all of it.
Go EventBus took 33 seconds at 10M and hung at 100M. Each subscriber holds a reflect.Value + a name string in a map-of-slice-of-struct. Reflection is not free even when you aren’t dispatching.
C++ entt was the most surprising: 100k registered in 1.3 ms, then hang at 1M. entt::sigh stores calls in a std::vector<delegate> but doing 1M sink.connect calls runs into reallocation cost that scaled worse than I expected.
Python blinker hung at 10M. Blinker tracks receivers in a WeakValueDictionary; at that scale, the dict’s rehash + weakref book-keeping dominated.

For me the big lesson is: how your library stores subscribers is more important than how fast it dispatches them. A flat Vec or (fn_ptr, data_ptr) pair is ugly, but it wins at scale every time.

Payload size: Linux lied to me about Node

payload_size_bytes

This axis emits one event whose payload is N bytes. The handler reads first+last byte to materialize the allocation. Ramp: 8 B → 16 GiB.

Everyone OOMs at 16 GiB — except Node, which reports ok in 0.5 ms.

node,node-ee3,payload_size_bytes,17179869184,ok,0.536668

That’s wrong. Or rather: that’s Linux, not Node. Buffer.alloc(16 GiB) creates a V8 ArrayBuffer of 16 GiB; the backing allocation is an anonymous mmap of 16 GiB virtual memory. Linux doesn’t actually commit those pages until you touch them. Reading first+last byte only faults in 2 pages (8 KiB total). The kernel uses a single shared zero-page for all un-touched reads.

I left the number in the CSV because it’s real (the library didn’t crash), but it’s useless as a capacity measurement. A correct payload_size_bytes stress test would have the handler write to each byte. That changes the story: it becomes a memory-bandwidth benchmark, and Rust’s zero-copy Vec move starts to show its real advantage.

Lesson: “it didn’t OOM” is not the same as “it handled 16 GiB of data.” Always commit the memory you claim to test.

Sustained rate: the GC tax is real

sustained_rate

This axis emits N zero-sized events to 10 subscribers in a tight loop, sampling per-emit latency at a stride that yields ~1000 samples. The breaking point is p99/median > 100 (latency decoupling) or a crash.

Here’s the p99/median ratio at each library’s largest completed value:

library	last ok	total time	p99/median
rt-events	100M	9.6 s	2.4×
entt	100M	45 s	9.7×
EE3	10M	10.6 s	(not recorded — single-sample axis)
EventBus	1M	9.3 s	(reflect.Call allocates every frame; GC pauses dominate)
blinker	1M	60 s	(GIL + refcount + per-send dict walk)

rt-events’ p99/median of 2.4× at 100M events means the worst-case emit was 2.4× the median. That’s what you’d get from a CPU branch misprediction at worst — no GC, no allocator fragmentation, nothing queuing up.

entt hits 9.7× — some of that is noise; some is L3-cache line displacement as the loop runs long enough to evict things.

Go EventBus maxes out at 1M events in 9.3 seconds, two orders of magnitude slower than rt-events. Every Publish allocates a []reflect.Value for the handler arguments. The Go GC catches up eventually and runs; p99/median ratio was already >10× at 100k events.

Python blinker does 1M events in 60 seconds. Every send() walks a WeakValueDictionary, dereferences weakrefs, acquires the GIL, and increments/decrements refcounts. Sustained-rate is where blinker’s design choices have nowhere to hide.

The shape of the chart matters more than any single number: the slope of latency-vs-rate is almost entirely GC discipline. Rust’s allocator is not being called inside the loop. Go’s reflect.Call is. Python’s every operation is.

Recursive emit depth: all five libraries fail differently

recursive_emit_depth

This is the axis that taught me the most.

The handler re-emits the same event with depth - 1 and stops at 0. Every library has its own answer. Ranked by survival:

Python blinker: did not crash at 1M. Python 3.14’s stack handling survived one million nested sends in 9.3 seconds. With sys.setrecursionlimit(10**7). The C stack was fine.
rt-events: stack overflow at 1M (last ok at 100k in 4 ms).
entt: segfault at 100k (last ok at 10k in 10 ms). Each entt frame is heavier than an rt-events frame — inlined templates put more on the stack per call.
Node EE3: RangeError at 10k (last ok at 1k in 1 ms). V8’s default stack is small.
Go EventBus: cannot recurse at all. Publish holds a non-reentrant sync.Mutex across the entire dispatch; any in-handler call to Publish deadlocks. The stress binary detects this via a 3-second watchdog and reports wrong_result.

Go’s result is not a bug I discovered — it’s documented behavior. But “the library will deadlock if a handler calls back in” is the kind of thing that doesn’t show up on a perf benchmark and will absolutely ruin an afternoon in production. Great fucking job.

Python beating everyone else on depth is also a library-level win: blinker keeps its per-dispatch frame tiny, and CPython 3.14’s stack design keeps deep recursion from blowing out.

rt-events stack-overflows at 1M, which is roughly the single-threaded Linux process default stack (8 MiB). Each frame is the emit → trampoline → closure body → emit → … with ~8 bytes of stack each. That’s as good as I can get without switching to a trampoline queue, which would be a semantic change (no more Rust-native recursive dispatch).

Lesson: “re-entrancy” is a spec decision your library documentation rarely makes explicit. Stress-test it before you find out at 3 a.m.

Unsubscribe during dispatch: five different correct answers

The last axis runs three probes against each library: a handler that unsubscribes itself, one that unsubscribes a later sibling (ahead of the cursor), one that unsubscribes an earlier sibling (already fired).

Nobody crashes. Everyone gives a different answer:

rt-events (Rust): statically disallowed. emit(&self) takes a shared reference; off(&mut self) takes an exclusive one. A closure registered with on cannot capture &mut bus. The compiler rejects every version of this code. That’s the library’s answer to iterator invalidation: the class of bug is absent by construction.
entt (C++): sink.disconnect() during trigger() uses swap-and-pop on the underlying std::vector inside a reverse iteration. Result: one handler may get called twice, another zero times, depending on which sibling is disconnected. Defined, but surprising.
EE3 (Node): copy-on-write. removeListener allocates a new filtered array and re-assigns the map slot; the in-flight emit keeps iterating the original (pre-remove) array. Net effect: everyone fires in the current emit; removal takes effect on the next emit. Elegant.
EventBus (Go): deadlock. Unsubscribe acquires the same non-reentrant mutex that Publish already holds. Same 3-second watchdog as recursion.
blinker (Python): silently succeeds. Blinker iterates a snapshot of the receivers dict; disconnecting a receiver during dispatch just updates the dict for the next send.

Five libraries, five answers to the same question — and none of the answers are wrong. They’re design choices. “What does unsub-during-dispatch do on my library?” is not a question you want to discover empirically in production.

Breaking points overview

breaking_points

The bar chart shows the largest completed ramp value per library per axis. Hatched bars crashed; plain bars didn’t crash at the max we tested (we don’t know how much further they’d have gone).

If I had to pick one row that surprised me: rt-events surviving the full subscriber_count ramp (100M) without crashing. It took 3.1 seconds — not fast, but finite. The other four libraries hit a wall earlier because of how they store subscribers, not how they dispatch them.

Why you should stress-test

Perf benchmarks tell you how fast your hot path is. They don’t tell you:

What breaks first (subscriber storage? payload allocator? reflection?)
How it breaks (panic? segfault? silent wrong-result? deadlock?)
At what scale (10k? 10M? 10B?)
Whether “ok” means “handled it” or “didn’t touch it” (see the Node 16 GiB lazy-alloc case)
What the library’s policy is on re-entrancy and iterator invalidation

Happy-path benchmarks lie by omission. They don’t distinguish between a library that does the right thing at the edge and one that deadlocks silently. Both can get 30 ns/op on the microbench.

Stress-test axes worth stealing for any pub/sub library you use:

Max subscriber count per type
Max payload size (with actual writes, not just allocation)
Sustained emit rate (p99/median ratio over time)
Recursive / re-entrant dispatch depth
Mutation during dispatch (unsubscribe, subscribe, emit-different-event)

If you can’t tell me where your library breaks, you don’t know what you shipped.

Run it yourself

git clone https://github.com/oriongonza/rt-event-benches
cd rt-event-benches
scripts/run-stress.sh              # all languages (takes ~30 min)
scripts/run-stress.sh rust-rt-events  # one language
TIMEOUT=30 AXIS=recursive_emit_depth scripts/run-stress.sh cpp-entt
python3 scripts/aggregate-stress.py
python3 scripts/plot-stress.py

Every row of results/stress.csv carries language, library, axis, value, status, latency_ms, peak_memory_mb, death_mode, runtime_version, commit, and notes. If you disagree with a classification, the row is the source of truth.

The harness (scripts/run-stress.sh) is ~200 lines of bash; the per-language stress binaries are each ~150 lines. Add a new library by writing a stress.sh that responds to <axis> <value> with a JSON line on stdout and exits. The five axes port cleanly to any in-process dispatcher you care to put under the same microscope.

Tags: #rust #performance #benchmarks #events #testing #edge-cases