Performance Hints

6 minute read

TL;DR

Performance optimization is not a chore to be postponed for “later.” It is a process of calculating computer resource costs and making design-level choices to minimize waste. The key is algorithmic efficiency, compact memory layouts, and reducing unnecessary memory allocations.

Overview

1. The Importance of Thinking about Performance

Performance-Driven Design: The mindset of “let’s profile and fix it later” often results in a “flat profile” where bottlenecks are scattered everywhere, making it impossible to fix. You should choose efficient alternatives from the start.

2. Estimation

Cultivating Performance Intuition: Decide how much complexity to tolerate based on whether the code you are writing is for testing, a library, or a hot path.
Back-of-the-envelope Calculations: Calculate the costs from L1 cache (0.5 ns) to disk seek (5 ms) to determine which design is advantageous before implementing it.

3. Measurement

The Value of Measurement: Profiling unfamiliar code helps you understand its structure and is the only way to precisely verify the benefits before and after improvements.
Profiling Tools and Tips: Utilize pprof or perf, and perform microbenchmarks under conditions similar to the production environment to validate the results.
What to Do When Profiles Are Flat: If there is no clear bottleneck, gather 20 small improvements, try changing loop structures, reduce allocation counts, or analyze hardware counters.

4. API Considerations

Bulk APIs: Provide interfaces that process multiple items at once to reduce function call overhead and lock acquisition costs.
View Types: Use string_view or Span to reference memory regions without copying data.
Pre-allocated/Pre-calculated Arguments: Pass information already known by the caller to the routine to prevent redundant calculations or allocations internally.
Thread-Compatible vs. Thread-Safe: Instead of blindly applying internal locks, default to “thread-compatible” types where the caller is responsible for synchronization, preventing performance loss.

5. Algorithmic Improvements

Fundamental Optimization: Change O(N²) logic to O(N log N) or replace it with faster data structures based on recent academic papers (e.g., the Pearce-Kelly algorithm).

6. Better Memory Representation

Compact Data Structures: Pack data tightly in memory to improve cache efficiency.
Memory Layout: Adjust field order to reduce padding and group frequently used fields together to prevent cache misses.
Indices Instead of Pointers: Use small integer indices instead of 64-bit pointers to reduce memory consumption and indirection costs.
Batched Storage: Group and allocate elements in chunks rather than allocating them individually to reduce management overhead.
Inlined Storage: Utilize internal space directly when the number of elements is small (e.g., InlinedVector) to avoid heap allocation.
Avoid Unnecessarily Nested Maps: Instead of having a map inside a map, use a single map with a composite key to reduce the number of lookups.
Arenas: Allocate objects with similar lifetimes together in a large memory block to minimize deallocation costs.
Arrays Instead of Maps: If the key range is small, use simple arrays instead of maps to achieve O(1) performance.
Bit Vectors Instead of Sets: Use bit vectors for integer sets to drastically save memory and accelerate operations.

7. Reduce Allocations

Avoid Unnecessary Allocations: Reuse static objects or actively use the stack instead of the heap rather than creating empty objects.
Resize or Reserve Containers: Call reserve() with the expected size in advance to prevent data relocation and memory reallocation overhead.
Avoid Copying: Utilize move semantics, and sort pointer or index arrays rather than moving large objects themselves.
Reuse Temporary Objects: Declare objects outside of loops instead of declaring them inside, so that internal buffers can be reused continuously.

8. Avoid Unnecessary Work

Fast Paths for Common Cases: Check simple, common cases first to skip heavy logic.
Pre-calculate Expensive Information: Pre-calculate and store the results of expensive operations that are used repeatedly.
Move Expensive Calculations Out of Loops: Calculate values that do not change inside a loop before the loop starts.
Defer Expensive Calculations: Postpone calculations as much as possible until the result is truly needed.
Specialize Code: Write custom code tailored to specific situations instead of using generic libraries to eliminate the overhead of general-purpose features.
Leverage Caching: Store the results of previously performed work and return them immediately upon subsequent calls.

9. Make the Compiler’s Job Easier

Provide Optimization Hints: Remove calls inside hot functions, prevent aliasing by copying to local variables, and manually unroll loops to help the compiler generate better machine code.

10. Reduce Stats Collection Costs

Optimize Statistics: Aggressively remove unused statistics and use sampling to record only a fraction of requests, lowering the overhead.

11. Avoid Logging on Hot Code Paths

Remove and Defer Logging: In frequently executed paths, check if logging is enabled beforehand or remove logging altogether to preserve execution speed.

12. Code Size Considerations

Trim Frequently Inlined Code: Exclude large code blocks, such as error handling in widely used functions, from inlining to prevent binary bloat.
Inline Judiciously: Indiscriminate inlining increases binary size and degrades instruction cache efficiency, so control it appropriately.
Reduce Template Instantiation: Extract common logic out of templates to reduce duplicated machine code.
Reduce Container Operations: Use bulk operations to process tasks like map initialization all at once to reduce the generated code size.

13. Parallelization and Synchronization

Exploit Parallelism: Divide tasks to utilize multi-cores, but manage coordination costs using batch-unit processing.
Amortize Lock Acquisition: Process multiple operations under a single lock to reduce the number of lock acquisitions.
Keep Critical Sections Short: Minimize work performed while holding a lock, and perform expensive operations outside critical sections.
Reduce Contention through Sharding: Split a single lock into multiple shards to reduce collision probability among threads.
SIMD Instructions: Actively utilize CPU capabilities that process multiple data items with a single instruction.
Reduce False Sharing: Place thread-specific data on different cache lines to prevent unnecessary cache synchronization.
Reduce Context Switch Frequency: Process tiny tasks directly on the current thread instead of passing them to a thread pool.
Buffered Channels for Pipelining: Provide appropriate buffers in channels to reduce waiting times between sender and receiver threads.
Lock-free Access: Use atomic operations instead of mutexes to implement wait-free data structures.

14. Protocol Buffer Advice

Minimize Serialization Costs: Avoid unnecessary nesting, prioritize field numbers 1–15, choose appropriate numeric types, use VIEW/CORD, and utilize arenas to reduce protobuf overhead.

15. C++-Specific Advice

Leverage Optimized Libraries: Apply language- and library-specific techniques, such as using flat_hash_map, InlinedVector, and optimizing Status/StatusOr.

Takeaways

Hardware Does Not Lie: Understanding the cache hierarchy and memory access costs is the starting point of performance tuning.
Design Is Performance: Designing efficient APIs and data layouts is far more powerful than optimizing code later.
Small Improvements Accumulate: Do not ignore a 1% improvement. These savings stack up to determine the stability of large-scale systems.
Prove Hypotheses by Measuring: Do not rely on intuition alone; use profilers to verify where resources are actually leaking.

Reference

Jeffrey Dean & Sanjay Ghemawat, Performance Hints, 2025

songkwwwwwww