Performance Hints
TL;DR
Performance optimization is not a chore to be postponed for “later.” It is a process of calculating computer resource costs and making design-level choices to minimize waste. The key is algorithmic efficiency, compact memory layouts, and reducing unnecessary memory allocations.
Overview
1. The Importance of Thinking about Performance
- Performance-Driven Design: The mindset of “let’s profile and fix it later” often results in a “flat profile” where bottlenecks are scattered everywhere, making it impossible to fix. You should choose efficient alternatives from the start.
2. Estimation
- Cultivating Performance Intuition: Decide how much complexity to tolerate based on whether the code you are writing is for testing, a library, or a hot path.
- Back-of-the-envelope Calculations: Calculate the costs from L1 cache (0.5 ns) to disk seek (5 ms) to determine which design is advantageous before implementing it.
3. Measurement
- The Value of Measurement: Profiling unfamiliar code helps you understand its structure and is the only way to precisely verify the benefits before and after improvements.
- Profiling Tools and Tips: Utilize
pproforperf, and perform microbenchmarks under conditions similar to the production environment to validate the results. - What to Do When Profiles Are Flat: If there is no clear bottleneck, gather 20 small improvements, try changing loop structures, reduce allocation counts, or analyze hardware counters.
4. API Considerations
- Bulk APIs: Provide interfaces that process multiple items at once to reduce function call overhead and lock acquisition costs.
- View Types: Use
string_vieworSpanto reference memory regions without copying data. - Pre-allocated/Pre-calculated Arguments: Pass information already known by the caller to the routine to prevent redundant calculations or allocations internally.
- Thread-Compatible vs. Thread-Safe: Instead of blindly applying internal locks, default to “thread-compatible” types where the caller is responsible for synchronization, preventing performance loss.
5. Algorithmic Improvements
- Fundamental Optimization: Change O(N²) logic to O(N log N) or replace it with faster data structures based on recent academic papers (e.g., the Pearce-Kelly algorithm).
6. Better Memory Representation
- Compact Data Structures: Pack data tightly in memory to improve cache efficiency.
- Memory Layout: Adjust field order to reduce padding and group frequently used fields together to prevent cache misses.
- Indices Instead of Pointers: Use small integer indices instead of 64-bit pointers to reduce memory consumption and indirection costs.
- Batched Storage: Group and allocate elements in chunks rather than allocating them individually to reduce management overhead.
- Inlined Storage: Utilize internal space directly when the number of elements is small (e.g.,
InlinedVector) to avoid heap allocation. - Avoid Unnecessarily Nested Maps: Instead of having a map inside a map, use a single map with a composite key to reduce the number of lookups.
- Arenas: Allocate objects with similar lifetimes together in a large memory block to minimize deallocation costs.
- Arrays Instead of Maps: If the key range is small, use simple arrays instead of maps to achieve O(1) performance.
- Bit Vectors Instead of Sets: Use bit vectors for integer sets to drastically save memory and accelerate operations.
7. Reduce Allocations
- Avoid Unnecessary Allocations: Reuse static objects or actively use the stack instead of the heap rather than creating empty objects.
- Resize or Reserve Containers: Call
reserve()with the expected size in advance to prevent data relocation and memory reallocation overhead. - Avoid Copying: Utilize move semantics, and sort pointer or index arrays rather than moving large objects themselves.
- Reuse Temporary Objects: Declare objects outside of loops instead of declaring them inside, so that internal buffers can be reused continuously.
8. Avoid Unnecessary Work
- Fast Paths for Common Cases: Check simple, common cases first to skip heavy logic.
- Pre-calculate Expensive Information: Pre-calculate and store the results of expensive operations that are used repeatedly.
- Move Expensive Calculations Out of Loops: Calculate values that do not change inside a loop before the loop starts.
- Defer Expensive Calculations: Postpone calculations as much as possible until the result is truly needed.
- Specialize Code: Write custom code tailored to specific situations instead of using generic libraries to eliminate the overhead of general-purpose features.
- Leverage Caching: Store the results of previously performed work and return them immediately upon subsequent calls.
9. Make the Compiler’s Job Easier
- Provide Optimization Hints: Remove calls inside hot functions, prevent aliasing by copying to local variables, and manually unroll loops to help the compiler generate better machine code.
10. Reduce Stats Collection Costs
- Optimize Statistics: Aggressively remove unused statistics and use sampling to record only a fraction of requests, lowering the overhead.
11. Avoid Logging on Hot Code Paths
- Remove and Defer Logging: In frequently executed paths, check if logging is enabled beforehand or remove logging altogether to preserve execution speed.
12. Code Size Considerations
- Trim Frequently Inlined Code: Exclude large code blocks, such as error handling in widely used functions, from inlining to prevent binary bloat.
- Inline Judiciously: Indiscriminate inlining increases binary size and degrades instruction cache efficiency, so control it appropriately.
- Reduce Template Instantiation: Extract common logic out of templates to reduce duplicated machine code.
- Reduce Container Operations: Use bulk operations to process tasks like map initialization all at once to reduce the generated code size.
13. Parallelization and Synchronization
- Exploit Parallelism: Divide tasks to utilize multi-cores, but manage coordination costs using batch-unit processing.
- Amortize Lock Acquisition: Process multiple operations under a single lock to reduce the number of lock acquisitions.
- Keep Critical Sections Short: Minimize work performed while holding a lock, and perform expensive operations outside critical sections.
- Reduce Contention through Sharding: Split a single lock into multiple shards to reduce collision probability among threads.
- SIMD Instructions: Actively utilize CPU capabilities that process multiple data items with a single instruction.
- Reduce False Sharing: Place thread-specific data on different cache lines to prevent unnecessary cache synchronization.
- Reduce Context Switch Frequency: Process tiny tasks directly on the current thread instead of passing them to a thread pool.
- Buffered Channels for Pipelining: Provide appropriate buffers in channels to reduce waiting times between sender and receiver threads.
- Lock-free Access: Use atomic operations instead of mutexes to implement wait-free data structures.
14. Protocol Buffer Advice
- Minimize Serialization Costs: Avoid unnecessary nesting, prioritize field numbers 1–15, choose appropriate numeric types, use
VIEW/CORD, and utilize arenas to reduce protobuf overhead.
15. C++-Specific Advice
- Leverage Optimized Libraries: Apply language- and library-specific techniques, such as using
flat_hash_map,InlinedVector, and optimizingStatus/StatusOr.
Takeaways
- Hardware Does Not Lie: Understanding the cache hierarchy and memory access costs is the starting point of performance tuning.
- Design Is Performance: Designing efficient APIs and data layouts is far more powerful than optimizing code later.
- Small Improvements Accumulate: Do not ignore a 1% improvement. These savings stack up to determine the stability of large-scale systems.
- Prove Hypotheses by Measuring: Do not rely on intuition alone; use profilers to verify where resources are actually leaking.
Reference
Jeffrey Dean & Sanjay Ghemawat, Performance Hints, 2025
Further Reading
- Optimizing software in C++ by Agner Fog. Describes many useful low-level techniques for improving performance.
- Understanding Software Dynamics by Richard L. Sites. Covers expert methods and advanced tools for diagnosing and fixing performance problems.
- Performance tips of the week - a collection of useful tips.
- Performance Matters - a collection of articles about performance.
- Daniel Lemire’s blog - high performance implementations of interesting algorithms.
- Building Software Systems at Google and Lessons Learned - a video that describes system performance issues encountered at Google over a decade.
- Programming Pearls and More Programming Pearls: Confessions of a Coder by Jon Bentley. Essays on starting with algorithms and ending up with simple and efficient implementations.
- Hacker’s Delight by Henry S. Warren. Bit-level and arithmetic algorithms for solving some common problems.
- Computer Architecture: A Quantitative Approach by John L. Hennessy and David A. Patterson - Covers many aspects of computer architecture, including one that performance-minded software developers should be aware of like caches, branch predictors, TLBs, etc.
Leave a comment