Performance Engineering on Hard Mode with Andrew Hunter: Summary

12 minute read

Performance Engineering on Hard Mode

This post provides a comprehensive, structured summary of Episode 18 of Jane Street’s Signals and Threads podcast: “Performance Engineering on Hard Mode”.

In this episode, host Ron Minsky sits down with Andrew Hunter—a software engineer on Jane Street’s Market Data team and former Tech Lead for Google’s tcmalloc. They discuss the structural differences between hyperscale and low-latency systems optimization, the mechanics of modern profiling tools, memory model trade-offs in OCaml vs. C++, and how hardware constraints and iteration loops shape high-performance engineering.


1. The World and Motivation of Performance Engineering

Understanding the drive behind performance optimization and the mindset required to master low-level systems.

  • The Addictive Nature of Performance Optimization (The Hook)
    • Electric High: The primary motivator is the direct, physical rush of achievement felt when diagnosing a bottleneck and making a complex system run tangibly faster.
    • Passion for the “Boring”: High-performance engineering requires an intense curiosity about low-level hardware details (such as CPU pipeline architectures, interconnect buses, and compilers) that others might find tedious or dry.
  • Deep Investigation Without Scope Boundaries
    • Learning Style: Real learning happens when you refuse to accept “out of scope” as an answer. For example, during college OS classes, Andrew built his foundation by reading the actual Linux virtual memory source code rather than just relying on textbooks.
    • Digging to the Core: A great systems engineer never ignores low-level mysteries. If a process behaves strangely, you must follow the trail all the way down into the kernel and CPU registers.
  • Academic Research vs. Industrial Reality
    • Academia: Often focuses on one-off research projects to prove a concept, write a paper, and then discard the code once the grade or publication is secured.
    • Industry: The real work begins after code rollout. Deploying code to production exposes it to real data and unpredictable edge cases. Maintaining a system over time and responding to production issues builds a feedback loop that refines engineering intuition. It is a vital transition from the academic mindset of “turning it in and getting an A” to taking ongoing responsibility for production systems.

2. Hyperscale (“Easy Mode”) vs. Low-Latency (“Hard Mode”) Optimization

A comparison of the optimization environments at massive web hyperscalers like Google vs. latency-critical trading firms like Jane Street.

  • Hyperscale Optimization: “Easy Mode”
    • Google’s tcmalloc Example: Andrew spent seven years at Google working on multithreaded architectures, serving as the Tech Lead for tcmalloc (Thread-Caching Malloc)—Google’s world-class, highly scalable memory allocator.
    • High Leverage: When operating millions of machines, saving even 0.5% or 1% of CPU cycles yields massive financial savings (tens of millions of dollars in hardware and electricity).
    • The Data Center Tax: A significant portion (10–20%) of hyperscaler compute cycles is spent on common infrastructure rather than business logic (e.g., memory allocation, serialization, logging, and compression).
    • Target-Rich Environment: Because common infrastructure is shared across all services, profile hotspots are highly visible and straightforward to identify. Even though the optimizations themselves require deep expertise, knowing what to optimize is highly obvious, which makes it “Easy Mode.”
  • Low-Latency Trading Optimization: “Hard Mode”
    • Tail Latency Matters: Improving total CPU throughput by 1% is virtually worthless. Instead, optimization targets the worst-case latency (e.g., 99th percentile tail latency) during peak market activity (burst periods).
    • Intentional Idling: High-performance trading systems use user-space polling IO to avoid kernel context switches. The CPU spins in a tight loop waiting for packets, causing traditional profilers to report that the system is 95-99% idle.
    • Real-world Latency Bug Example: Andrew described a case where the send_order logic was extremely cheap to evaluate and execute (accounting for less than 5% of CPU cycles in the profiler). However, orders were being dispatched 200 microseconds after the market evaluation was complete. The bottleneck was a system misconfiguration: the system used an older API that queued network traffic onto a low-priority thread pool, assuming the data wasn’t latency-critical. Traditional profiling failed to catch this, but it was easily resolved by switching to an eager API after tracing the timeline.
    • Bypassing the “Never Optimize Without Measuring” Dogma: Andrew challenges the common academic rule of “never optimize code before measuring it via profiling.” When building systems where latency is the defining constraint, developers must build self-discipline to make efficient choices by default at the design stage, as long as the optimizations don’t make the code unsafe or unmaintainable.
    • The Origin of Mechanical Sympathy: The term originated from F1 racing champion Jackie Stewart, who argued that drivers must intuitively understand how the car’s engine, gears, and suspension work mechanically to push it to the speed limit. Similarly, software engineers must possess mechanical sympathy—an awareness of how high-level code constructs interact with physical hardware components like CPU caches, bus lines, and pipelines.

3. Measurement & Profiling Tools

The mechanics, use cases, and limitations of profiling and tracing technologies.

  • Sampling Profilers (e.g., Linux perf)
    • How It Works: Periodic interrupts (e.g., every 100 microseconds) halt the program to record the instruction pointer and stack trace.
    • Performance Monitoring Counters (PMCs): Allows sampling based on hardware events (like L2 cache misses or branch mispredictions) rather than raw time, pinpointing why a particular section of code is slow.
    • Limitations: Because sampling is statistical, it cannot detect structural delays where code runs quickly but sits waiting in a queue (like the low-priority queue bug mentioned above).
  • Retrospective Tracing (e.g., magic-trace)
    • How It Works: Uses Intel Processor Trace (PT), a hardware-level feature that records every branch the CPU executes into a circular ring buffer with minimal overhead. When a latency event or lag threshold (trigger) is reached, the buffer is captured, reconstructing a nanosecond-accurate timeline of function calls leading up to the trigger.
    • Advantages: Unlike statistical sampling, it shows the exact sequence of instructions. It reveals whether a function taking up 40% of runtime is caused by a single slow call or thousands of tiny calls nested within a tight loop.
    • Overhead: Despite hardware acceleration, Intel PT incurs a 5–15% performance penalty. As a result, it is not run continuously in production but is instead enabled dynamically or in specialized testbeds.
  • The Reality of Tail Latencies and Queuing Theory
    • Major GC Halts: Capturing worst-case latencies with magic-trace often exposes Stop-the-World pauses caused by the Garbage Collector’s Major GC. In such cases, minimizing heap allocations is the obvious path forward.
    • Tails as Repeated Medians: Intriguingly, most tail latencies do not arise from bizarre bugs or weird kernel faults. Instead, they are simply “normal median processing times repeatedly piled up,” matching classic Queuing Theory.
    • Message Bursts and Batch Processing: While a single packet may take only 1 microsecond to process, a sudden burst of 10,000 incoming packets (e.g., a massive market feed update) causes significant queuing delays for packets at the back of the queue. While micro-optimizing individual processing functions helps, introducing batch processing at the architecture level is often the most effective way to eliminate queue build-up and manage tail latencies.
  • Memory Lifecycle Tracing (e.g., memtrace)
    • Provides a timeline showing when a sampled allocation was created and when it was freed. This helps track down memory churn and optimize garbage collection (GC) interaction.

4. Performance Visualization: Flame Graphs vs. pprof

Analyzing the strengths and weaknesses of different profiling visualizations.

  • Flame Graphs
    • Strengths: Highly intuitive. The horizontal width represents CPU usage percentage, making it see where cycles are spent at a glance.
    • Weaknesses (Missing Join Points): If multiple independent functions (callers) call the same low-level helper (callee, such as malloc), those calls appear as tiny, separate segments at the top of the graph. If 15 different paths call malloc, each taking 2%, you might ignore them. In reality, malloc is consuming 30% of total runtime, but the Flame Graph obscures this.
  • Google’s pprof (DAG Visualization)
    • How It Works: Renders call stacks as a Directed Acyclic Graph (DAG), representing each function as a single node and calls as edges.
    • Strengths: Exposes “join points” by merging all calls to a specific function (like malloc) into a single node with thick incoming arrows, highlighting shared infrastructure bottlenecks.
    • Weaknesses: Has a steeper learning curve than Flame Graphs and can appear visually complex to new users.

5. OCaml vs. C++: Language Paradigms in High Performance

How Jane Street manages performance trade-offs in a garbage-collected language like OCaml compared to C++.

  • The Compiler Ecosystem Scale
    • C++: Benefits from decades of optimization work on Clang/LLVM and GCC by thousands of hyperscaler engineers, yielding advanced loop unrolling and register allocation.
    • OCaml: Operating at a smaller scale, OCaml’s compiler has fewer aggressive micro-optimizations, requiring developers to work around limitations in code generation.
  • Memory Layout & GC Overhead
    • Boxy Representation: OCaml structures are heavily boxed, leading to multiple pointer indirection layers. This increases cache footprint compared to C++, where data can be laid out flatly in contiguous memory.
    • GC Pointer Validation and Nulling Out: To prevent crashes during garbage collection, OCaml cannot easily leave uninitialized memory slots. If a stale array slot contains arbitrary garbage bits and the GC attempts to treat those bits as a valid pointer address, it will cause heap corruption or process crashes. Array elements must be explicitly nulled out, introducing minor runtime overhead.
  • The Real-world Cost of Optimization & Language Ergonomics
    • Awkwardness as a Barrier to Optimization: In OCaml, achieving low-level memory efficiency is not mathematically impossible. For instance, you could pre-allocate a single 64GB integer array and manipulate raw addresses directly, mimicking manual C-style memory management. However, this is unreadable, unsafe, and extremely unpleasant.
    • The Ergonomics-Performance Correlation: Ron Minsky notes that “it’s not that any of the optimizations are impossible, it’s that they’re a little bit more awkward than you would like… and when you make things harder, they happen less.” Ultimately, a programming language’s ergonomics directly affect the final performance of real systems. If the path to writing optimized code is highly friction-prone, developers will default to simpler, less efficient code, causing a real performance tax.
  • Foreign Function Interface (FFI) Overhead: Java JNI vs. OCaml C Stubs
    • Java JNI (300-400ns): Transitioning from JVM to native C code is highly expensive due to garbage collector safe-point synchronization, virtual memory protection guards, and object marshalling.
    • OCaml FFI (3-4ns): OCaml compiles directly to native binary code, allowing it to call C functions via simple, lightweight stubs without runtime conversion or safety-guard pauses. This extremely low FFI cost allows Jane Street to integrate performance-critical C libraries seamlessly.
  • Jane Street’s Workarounds
    • DSLs for Contiguous Layouts: For high-speed packet parsing (e.g. NASDAQ feed parser), Jane Street uses Domain-Specific Languages (DSLs). These generate type-safe interfaces backed by raw, flattened byte buffers, bypassing OCaml’s default heap allocation.
    • Zero-Alloc OCaml Dialect: A custom compiler variant developed internally that enforces zero heap allocation in critical paths. Developers bypass the GC by using manual memory management (malloc) for hot-path buffers.
    • Ongoing Compiler Work (Layout Control): Jane Street is actively investing in OCaml compiler development to introduce layout control directly into the OCaml type system (such as the Flambda backend and layouts extension). This will allow developers to declare contiguous memory layouts natively within standard OCaml types.

6. Discipline and Prioritization in Performance Engineering

The ability to look at suboptimal code and intentionally choose not to optimize it, focusing engineering efforts solely on what moves the business metric.

  • The Discipline of Not Fixing Everything
    • Overcoming Personal Offense: A natural performance engineer feels personal discomfort or offense when encountering slow, allocating, or poorly structured code within a codebase. However, Andrew emphasizes that one of the hardest disciplines to learn in this field is to look at sub-optimal code and say, “No, I am not fixing that.”
  • Strict Prioritization Based on Real-World Impact
    • Excluding Irrelevant Optimizations: Even if a piece of code is technically slow or allocating unnecessarily, if it sits outside the hot paths and has zero impact on tail latency or overall business throughput, it must be ignored.
    • Focusing on the Core Bottlenecks: Engineering hours and human attention are scarce resources. Micro-optimizing parts of a system that do not matter is a distraction. The primary discipline of a performance engineer is to focus exclusively on the one or two critical paths where performance wins yield genuine, measurable business value.

7. Hardware, Architecture, and Iteration Loops

The division of labor between hardware and software, and the importance of rapid iteration.

  • Hardware Acceleration (FPGA) vs. CPUs
    • Physical Limits: Crossing the PCIe bus twice (NIC -> RAM -> CPU -> NIC) introduces roughly 800 nanoseconds of latency. FPGAs bypass this entirely, returning packets in under 100 nanoseconds.
    • Development Friction: FPGAs are difficult to program, require writing Verilog/VHDL, and compile times can exceed 24 hours.
    • Hybrid Architecture: Low-latency designs allocate simple, speed-critical paths to FPGAs, while delegating complex, rapidly changing trading logic to high-performance software.
  • Human-Responsive Systems and Creative Juices
    • User Interfaces and Cognitive Barriers: Performance engineering isn’t just for bots. If a text editor freezes for 5 seconds while a developer is typing, it causes immense frustration and disrupts flow state.
    • Feedback Cycles and Creativity: For researchers testing hypotheses, receiving simulation results in 10 minutes rather than a day changes how they think. Fast iteration loops keep “creative juices” flowing, allowing researchers to run dozens of experiments sequentially without losing focus.
  • John Boyd’s OODA Loop & Iteration Speed
    • OODA Loop (Observe, Orient, Decide, Act): The faster an organization completes this cycle, the more effectively it can adapt to and control its environment.
    • The Whiskey Analogy (Bourbon Aging): Traditional bourbon requires 5–10 years to mature, limiting feedback loops. Using technology to accelerate aging allows tasting and iterating in a month. Even if the quality of a single iteration is slightly lower, completing 12 iterations in the time others take to complete one accelerates overall innovation.

8. The Performance Mindset

At the conclusion of the podcast, Ron and Andrew share insights into the unique mindset and cognitive traits that define a successful performance engineer.

  • Shared Intuition Across Fields
    • Universal Habits of Mind: Optimizing web applications in a browser (on top of a highly complex and idiosyncratic virtual machine) requires a very similar set of intuitions and cognitive habits as optimizing a low-latency trading backend.
    • Platform Independence: While the underlying instruction sets and runtimes differ completely, the core methodology—how an engineer approaches, models, and solves a performance bottleneck—remains fundamentally the same across domains.
  • An Obsession with “Boring” Details
    • Delight in the Gory Guts: The defining characteristic of a great performance engineer is a deep, genuine interest in the “gory details of the guts” of a system—details that others might find incredibly tedious or boring.
    • Reframing Tedium: Low-level micro-architectural behaviors or compiler quirks that are difficult to explain to friends and family without putting them to sleep are viewed by performance engineers not as chore work, but as engaging puzzles.
  • An Untrainable Disposition
    • Innate Curiosity: Andrew questions whether this performance mindset can be actively trained or taught.
    • Intrinsic Motivation: The urge to look at low-level runtime behaviors and think, “This is fascinating; you wouldn’t even have to pay me to look at this” (even if they wouldn’t do it for free in reality) points to an intrinsic, obsessive curiosity. This innate drive to pull back the curtain and understand how things work at the lowest levels is the true engine of performance engineering.

Leave a comment