Performance Hints

6 minute read

TL;DR

Performance optimization is not a chore to be postponed for “later.” It is a process of calculating computer resource costs and making design-level choices to minimize waste. The key is algorithmic efficiency, compact memory layouts, and reducing unnecessary memory allocations.


Overview

1. The Importance of Thinking about Performance

  • Performance-Driven Design: The mindset of “let’s profile and fix it later” often results in a “flat profile” where bottlenecks are scattered everywhere, making it impossible to fix. You should choose efficient alternatives from the start.

2. Estimation

  • Cultivating Performance Intuition: Decide how much complexity to tolerate based on whether the code you are writing is for testing, a library, or a hot path.
  • Back-of-the-envelope Calculations: Calculate the costs from L1 cache (0.5 ns) to disk seek (5 ms) to determine which design is advantageous before implementing it.

3. Measurement

  • The Value of Measurement: Profiling unfamiliar code helps you understand its structure and is the only way to precisely verify the benefits before and after improvements.
  • Profiling Tools and Tips: Utilize pprof or perf, and perform microbenchmarks under conditions similar to the production environment to validate the results.
  • What to Do When Profiles Are Flat: If there is no clear bottleneck, gather 20 small improvements, try changing loop structures, reduce allocation counts, or analyze hardware counters.

4. API Considerations

  • Bulk APIs: Provide interfaces that process multiple items at once to reduce function call overhead and lock acquisition costs.
  • View Types: Use string_view or Span to reference memory regions without copying data.
  • Pre-allocated/Pre-calculated Arguments: Pass information already known by the caller to the routine to prevent redundant calculations or allocations internally.
  • Thread-Compatible vs. Thread-Safe: Instead of blindly applying internal locks, default to “thread-compatible” types where the caller is responsible for synchronization, preventing performance loss.

5. Algorithmic Improvements

  • Fundamental Optimization: Change O(N²) logic to O(N log N) or replace it with faster data structures based on recent academic papers (e.g., the Pearce-Kelly algorithm).

6. Better Memory Representation

  • Compact Data Structures: Pack data tightly in memory to improve cache efficiency.
  • Memory Layout: Adjust field order to reduce padding and group frequently used fields together to prevent cache misses.
  • Indices Instead of Pointers: Use small integer indices instead of 64-bit pointers to reduce memory consumption and indirection costs.
  • Batched Storage: Group and allocate elements in chunks rather than allocating them individually to reduce management overhead.
  • Inlined Storage: Utilize internal space directly when the number of elements is small (e.g., InlinedVector) to avoid heap allocation.
  • Avoid Unnecessarily Nested Maps: Instead of having a map inside a map, use a single map with a composite key to reduce the number of lookups.
  • Arenas: Allocate objects with similar lifetimes together in a large memory block to minimize deallocation costs.
  • Arrays Instead of Maps: If the key range is small, use simple arrays instead of maps to achieve O(1) performance.
  • Bit Vectors Instead of Sets: Use bit vectors for integer sets to drastically save memory and accelerate operations.

7. Reduce Allocations

  • Avoid Unnecessary Allocations: Reuse static objects or actively use the stack instead of the heap rather than creating empty objects.
  • Resize or Reserve Containers: Call reserve() with the expected size in advance to prevent data relocation and memory reallocation overhead.
  • Avoid Copying: Utilize move semantics, and sort pointer or index arrays rather than moving large objects themselves.
  • Reuse Temporary Objects: Declare objects outside of loops instead of declaring them inside, so that internal buffers can be reused continuously.

8. Avoid Unnecessary Work

  • Fast Paths for Common Cases: Check simple, common cases first to skip heavy logic.
  • Pre-calculate Expensive Information: Pre-calculate and store the results of expensive operations that are used repeatedly.
  • Move Expensive Calculations Out of Loops: Calculate values that do not change inside a loop before the loop starts.
  • Defer Expensive Calculations: Postpone calculations as much as possible until the result is truly needed.
  • Specialize Code: Write custom code tailored to specific situations instead of using generic libraries to eliminate the overhead of general-purpose features.
  • Leverage Caching: Store the results of previously performed work and return them immediately upon subsequent calls.

9. Make the Compiler’s Job Easier

  • Provide Optimization Hints: Remove calls inside hot functions, prevent aliasing by copying to local variables, and manually unroll loops to help the compiler generate better machine code.

10. Reduce Stats Collection Costs

  • Optimize Statistics: Aggressively remove unused statistics and use sampling to record only a fraction of requests, lowering the overhead.

11. Avoid Logging on Hot Code Paths

  • Remove and Defer Logging: In frequently executed paths, check if logging is enabled beforehand or remove logging altogether to preserve execution speed.

12. Code Size Considerations

  • Trim Frequently Inlined Code: Exclude large code blocks, such as error handling in widely used functions, from inlining to prevent binary bloat.
  • Inline Judiciously: Indiscriminate inlining increases binary size and degrades instruction cache efficiency, so control it appropriately.
  • Reduce Template Instantiation: Extract common logic out of templates to reduce duplicated machine code.
  • Reduce Container Operations: Use bulk operations to process tasks like map initialization all at once to reduce the generated code size.

13. Parallelization and Synchronization

  • Exploit Parallelism: Divide tasks to utilize multi-cores, but manage coordination costs using batch-unit processing.
  • Amortize Lock Acquisition: Process multiple operations under a single lock to reduce the number of lock acquisitions.
  • Keep Critical Sections Short: Minimize work performed while holding a lock, and perform expensive operations outside critical sections.
  • Reduce Contention through Sharding: Split a single lock into multiple shards to reduce collision probability among threads.
  • SIMD Instructions: Actively utilize CPU capabilities that process multiple data items with a single instruction.
  • Reduce False Sharing: Place thread-specific data on different cache lines to prevent unnecessary cache synchronization.
  • Reduce Context Switch Frequency: Process tiny tasks directly on the current thread instead of passing them to a thread pool.
  • Buffered Channels for Pipelining: Provide appropriate buffers in channels to reduce waiting times between sender and receiver threads.
  • Lock-free Access: Use atomic operations instead of mutexes to implement wait-free data structures.

14. Protocol Buffer Advice

  • Minimize Serialization Costs: Avoid unnecessary nesting, prioritize field numbers 1–15, choose appropriate numeric types, use VIEW/CORD, and utilize arenas to reduce protobuf overhead.

15. C++-Specific Advice

  • Leverage Optimized Libraries: Apply language- and library-specific techniques, such as using flat_hash_map, InlinedVector, and optimizing Status/StatusOr.

Takeaways

  • Hardware Does Not Lie: Understanding the cache hierarchy and memory access costs is the starting point of performance tuning.
  • Design Is Performance: Designing efficient APIs and data layouts is far more powerful than optimizing code later.
  • Small Improvements Accumulate: Do not ignore a 1% improvement. These savings stack up to determine the stability of large-scale systems.
  • Prove Hypotheses by Measuring: Do not rely on intuition alone; use profilers to verify where resources are actually leaking.

Reference

Jeffrey Dean & Sanjay Ghemawat, Performance Hints, 2025

Further Reading

Leave a comment