O

Orión

notes / code / chaos

Systems
Orión
Orión
author

The Death Shape

Your benchmark says it's fast. My benchmark asks where it breaks.

A “fast” event bus in a happy-path microbench tells you nothing about how the library falls apart when you push it. I wrote a companion stress suite for rt-event-benches because I wanted to know exactly where each library gives up, and what shape the failure takes.

Five languages, five libraries, five ways to fail. No library survives every axis. None of them fail the same way.

All the data and code is in the repo. The charts below come from results/stress.csv (149 rows across five libraries) and scripts/plot-stress.py. The spec is STRESS.md.

The setup

Same cast as the first post: Rust rt-events, C++ entt::dispatcher, Go asaskevich/EventBus, Node eventemitter3, Python blinker. Five stress axes, each a geometric ramp (×10 per step). The harness runs each step with taskset, timeout 90s, and /usr/bin/time -v. A crash stops the ramp; the last surviving value is the breaking point.

The five axes:

  1. subscriber_count: N handlers for one event, dispatch once. Ramp: 10 → 100M.
  2. payload_size_bytes: one event with an N-byte payload. Ramp: 8 B → 16 GiB.
  3. sustained_rate: N emits in a tight loop to 10 subs, sample p99/median. Ramp: 10k → 1B.
  4. recursive_emit_depth: handler re-emits itself D times. Ramp: 1 → 1M.
  5. unsubscribe_during_dispatch: three pass/fail probes (self, ahead, behind).

Full contract: STRESS.md.

Instrumented essay

What changed in this version of the essay

+

The original post linked out to CSVs and remote chart assets. This version keeps a curated stress snapshot in the blog repo, indexes it into SQLite during the build, and lets the article expose the death matrix and sustained-rate slice as explorable tables instead of inert markdown.

The death matrix

benchmark

Stress matrix

This is the same death matrix, but now the evidence chips are explorable and the rows can be sorted by whichever axis you care about most.

rows

5

Sort

Library

Source

rt-event-benches / results/stress.csv

entt is the strongest direct competitor in the happy path, which makes its failure profile interesting rather than embarrassing.

Source
C++ entthang @ 1Moom @ 16Ghang @ 1Bsegfault @ 100kok
Go EventBushang @ 100Moom @ 16Ghang @ 10Mwrong_result / deadlock @ 2ok
Node ee3hang @ 100M"ok" @ 16Ghang @ 100Mpanic @ 10kok
Python blinkerhang @ 10Moom @ 16Ghang @ 10Mok @ 1Mok
Rust rt-eventsok @ 100Moom @ 16Ghang @ 1Bstack_overflow @ 1Mstatically disallowed

Read “ok @ N” as “completed the largest value we tested without crashing, so we don’t know where its real breaking point is.” Read “hang” as “didn’t finish in 90 seconds.” Everything else is a real crash with a specific signal or exception.

The grid is brutal: even at 149 data points, the five libraries crash in five different shapes. That’s the story.

Subscriber count: flat beats clever

Stress chart for subscriber count across five event libraries.

Subscriber count ramp

The chart is stored locally now, but still sourced from the same benchmark harness. The point is the storage story, not the exact line geometry.

rt-event-benches / plot-stress.py

This axis registers N handlers for one event and then emits once. It’s the simplest thing a dispatcher has to do, and it exposes the library’s internal data structure more directly than any other axis.

  • rt-events completed 100M subscribers in 3.1 seconds and didn’t crash. The hot path is for fn in vec { (fn.trampoline)(fn.data, &event) } — a flat Vec<(data_ptr, fn_ptr)> walk. No hashing, no reflection, no GC barriers. 100M pointer chases fit in ~1.6 GiB of contiguous RAM.
  • Node EE3 hung at 100M. Last completed value: 10M in 2.3 seconds. EE3 stores listeners in a JS array; at 10M entries you’re paying for V8’s backing-store growth and its GC visiting all of it.
  • Go EventBus took 33 seconds at 10M and hung at 100M. Each subscriber holds a reflect.Value + a name string in a map-of-slice-of-struct. Reflection is not free even when you aren’t dispatching.
  • C++ entt was the most surprising: 100k registered in 1.3 ms, then hang at 1M. entt::sigh stores calls in a std::vector<delegate> but doing 1M sink.connect calls runs into reallocation cost that scaled worse than I expected.
  • Python blinker hung at 10M. Blinker tracks receivers in a WeakValueDictionary; at that scale, the dict’s rehash + weakref book-keeping dominated.

For me the big lesson is: how your library stores subscribers is more important than how fast it dispatches them. A flat Vec or (fn_ptr, data_ptr) pair is ugly, but it wins at scale every time.

Payload size: Linux lied to me about Node

Stress chart for payload size across five event libraries.

Payload size ramp

The misleading Node row is the whole reason I now want the evidence chips in the article itself instead of only in a remote CSV.

rt-event-benches / stress payload plot

This axis emits one event whose payload is N bytes. The handler reads first+last byte to materialize the allocation. Ramp: 8 B → 16 GiB.

Everyone OOMs at 16 GiB — except Node, which reports ok in 0.5 ms.

node,node-ee3,payload_size_bytes,17179869184,ok,0.536668

That’s wrong. Or rather: that’s Linux, not Node. Buffer.alloc(16 GiB) creates a V8 ArrayBuffer of 16 GiB; the backing allocation is an anonymous mmap of 16 GiB virtual memory. Linux doesn’t actually commit those pages until you touch them. Reading first+last byte only faults in 2 pages (8 KiB total). The kernel uses a single shared zero-page for all un-touched reads.

The row behind that claim is . The article version here keeps the weirdness attached to the claim instead of pushing it into a repo tab.

I left the number in the CSV because it’s real (the library didn’t crash), but it’s useless as a capacity measurement. A correct payload_size_bytes stress test would have the handler write to each byte. That changes the story: it becomes a memory-bandwidth benchmark, and Rust’s zero-copy Vec move starts to show its real advantage.

Lesson: “it didn’t OOM” is not the same as “it handled 16 GiB of data.” Always commit the memory you claim to test.

Sustained rate: the GC tax is real

Stress chart for sustained-rate survival across five event libraries.

Sustained-rate ramp

The slope matters more than the single biggest number; the plot makes the GC story visible before you even read the table.

rt-event-benches / stress sustained-rate plot

This axis emits N zero-sized events to 10 subscribers in a tight loop, sampling per-emit latency at a stride that yields ~1000 samples. The breaking point is p99/median > 100 (latency decoupling) or a crash.

Here’s the p99/median ratio at each library’s largest completed value:

benchmark

Sustained-rate slice

Sort by endurance, elapsed time, or p99/median ratio. The blank ratio cells are intentional: the snapshot only carried that signal for the libraries the article discusses explicitly.

rows

5

Sort

Last ok events

Source

rt-event-benches / results/stress.csv

Focus

A p99/median of 2.4x at 100M events is the article's shorthand for “nothing pathological happened.”

Source
rt-events100,000,0009.62.4×hang @ 1B
entt100,000,00045.09.7×hang @ 1B
Node ee310,000,00010.6hang @ 100M
Go EventBus1,000,0009.3hang @ 10M
Python blinker1,000,00060.0hang @ 10M

rt-events’ p99/median of 2.4× at 100M events means the worst-case emit was 2.4× the median. That’s what you’d get from a CPU branch misprediction at worst — no GC, no allocator fragmentation, nothing queuing up.

entt hits 9.7× — some of that is noise; some is L3-cache line displacement as the loop runs long enough to evict things.

Go EventBus maxes out at 1M events in 9.3 seconds, two orders of magnitude slower than rt-events. Every Publish allocates a []reflect.Value for the handler arguments. The Go GC catches up eventually and runs; p99/median ratio was already >10× at 100k events.

Python blinker does 1M events in 60 seconds. Every send() walks a WeakValueDictionary, dereferences weakrefs, acquires the GIL, and increments/decrements refcounts. Sustained-rate is where blinker’s design choices have nowhere to hide.

The shape of the chart matters more than any single number: the slope of latency-vs-rate is almost entirely GC discipline. Rust’s allocator is not being called inside the loop. Go’s reflect.Call is. Python’s every operation is.

Recursive emit depth: all five libraries fail differently

Stress chart for recursive emit depth across five event libraries.

Recursive emit depth

This is the axis where every runtime feels like it is confessing something about itself.

rt-event-benches / recursive depth plot

This is the axis that taught me the most.

The handler re-emits the same event with depth - 1 and stops at 0. Every library has its own answer. Ranked by survival:

  1. Python blinker: did not crash at 1M. Python 3.14’s stack handling survived one million nested sends in 9.3 seconds. With sys.setrecursionlimit(10**7). The C stack was fine.
  2. rt-events: stack overflow at 1M (last ok at 100k in 4 ms).
  3. entt: segfault at 100k (last ok at 10k in 10 ms). Each entt frame is heavier than an rt-events frame — inlined templates put more on the stack per call.
  4. Node EE3: RangeError at 10k (last ok at 1k in 1 ms). V8’s default stack is small.
  5. Go EventBus: cannot recurse at all. Publish holds a non-reentrant sync.Mutex across the entire dispatch; any in-handler call to Publish deadlocks. The stress binary detects this via a 3-second watchdog and reports wrong_result.

Go’s result is not a bug I discovered — it’s documented behavior. But “the library will deadlock if a handler calls back in” is the kind of thing that doesn’t show up on a perf benchmark and will absolutely ruin an afternoon in production. Great fucking job.

Python beating everyone else on depth is also a library-level win: blinker keeps its per-dispatch frame tiny, and CPython 3.14’s stack design keeps deep recursion from blowing out.

rt-events stack-overflows at 1M, which is roughly the single-threaded Linux process default stack (8 MiB). Each frame is the emit → trampoline → closure body → emit → … with ~8 bytes of stack each. That’s as good as I can get without switching to a trampoline queue, which would be a semantic change (no more Rust-native recursive dispatch).

Lesson: “re-entrancy” is a spec decision your library documentation rarely makes explicit. Stress-test it before you find out at 3 a.m.

Unsubscribe during dispatch: five different correct answers

The last axis runs three probes against each library: a handler that unsubscribes itself, one that unsubscribes a later sibling (ahead of the cursor), one that unsubscribes an earlier sibling (already fired).

Nobody crashes. Everyone gives a different answer:

  • rt-events (Rust): statically disallowed. emit(&self) takes a shared reference; off(&mut self) takes an exclusive one. A closure registered with on cannot capture &mut bus. The compiler rejects every version of this code. That’s the library’s answer to iterator invalidation: the class of bug is absent by construction.
  • entt (C++): sink.disconnect() during trigger() uses swap-and-pop on the underlying std::vector inside a reverse iteration. Result: one handler may get called twice, another zero times, depending on which sibling is disconnected. Defined, but surprising.
  • EE3 (Node): copy-on-write. removeListener allocates a new filtered array and re-assigns the map slot; the in-flight emit keeps iterating the original (pre-remove) array. Net effect: everyone fires in the current emit; removal takes effect on the next emit. Elegant.
  • EventBus (Go): deadlock. Unsubscribe acquires the same non-reentrant mutex that Publish already holds. Same 3-second watchdog as recursion.
  • blinker (Python): silently succeeds. Blinker iterates a snapshot of the receivers dict; disconnecting a receiver during dispatch just updates the dict for the next send.

Five libraries, five answers to the same question — and none of the answers are wrong. They’re design choices. “What does unsub-during-dispatch do on my library?” is not a question you want to discover empirically in production.

Breaking points overview

Stress chart summarizing breaking points across five event libraries.

Breaking points overview

Plain bars mean the library survived the max tested value. Hatched bars mean it failed before that ceiling.

rt-event-benches / breaking points plot

The bar chart shows the largest completed ramp value per library per axis. Hatched bars crashed; plain bars didn’t crash at the max we tested (we don’t know how much further they’d have gone).

If I had to pick one row that surprised me: rt-events surviving the full subscriber_count ramp (100M) without crashing. It took 3.1 seconds — not fast, but finite. The other four libraries hit a wall earlier because of how they store subscribers, not how they dispatch them.

Why you should stress-test

Perf benchmarks tell you how fast your hot path is. They don’t tell you:

  • What breaks first (subscriber storage? payload allocator? reflection?)
  • How it breaks (panic? segfault? silent wrong-result? deadlock?)
  • At what scale (10k? 10M? 10B?)
  • Whether “ok” means “handled it” or “didn’t touch it” (see the Node 16 GiB lazy-alloc case)
  • What the library’s policy is on re-entrancy and iterator invalidation

Happy-path benchmarks lie by omission. They don’t distinguish between a library that does the right thing at the edge and one that deadlocks silently. Both can get 30 ns/op on the microbench.

Stress-test axes worth stealing for any pub/sub library you use:

  • Max subscriber count per type
  • Max payload size (with actual writes, not just allocation)
  • Sustained emit rate (p99/median ratio over time)
  • Recursive / re-entrant dispatch depth
  • Mutation during dispatch (unsubscribe, subscribe, emit-different-event)

If you can’t tell me where your library breaks, you don’t know what you shipped.

The death shape

The five axes above are five rays from the origin through input space. Each one tells you where the library breaks when you push a single input to its limit. None of them tell you what happens when two inputs load each other — 1000 req_A/s might survive; 10 req_A/s + 10 req_B/s might not.

Every piece of software has an ndspace of inputs. The death shape is a surface in that space. The five axes gave us five points on it. The real surface is continuous, and most of it is uncharted.

Real systems fail along a surface.

Run it yourself

git clone https://github.com/oriongonza/rt-event-benches
cd rt-event-benches
scripts/run-stress.sh              # all languages (takes ~30 min)
scripts/run-stress.sh rust-rt-events  # one language
TIMEOUT=30 AXIS=recursive_emit_depth scripts/run-stress.sh cpp-entt
python3 scripts/aggregate-stress.py
python3 scripts/plot-stress.py

Every row of results/stress.csv carries language, library, axis, value, status, latency_ms, peak_memory_mb, death_mode, runtime_version, commit, and notes. If you disagree with a classification, the row is the source of truth.

The harness (scripts/run-stress.sh) is ~200 lines of bash; the per-language stress binaries are each ~150 lines. Add a new library by writing a stress.sh that responds to <axis> <value> with a JSON line on stdout and exits. The five axes port cleanly to any in-process dispatcher you care to put under the same microscope.

Reactions

Discussion

Public notes from logged-in readers.

Loading comments…

stderr