CPU Profiling: What, How, and When

March 10, 2025 · 881 words · 5 min · Performance Analysis

What: What is CPU Profiling

A technique for analyzing program CPU performance. By collecting detailed data during program execution (such as function call frequency, time consumption, call stacks, etc.), it helps developers identify performance bottlenecks and optimize code efficiency. Typically used in performance analysis and root cause diagnosis scenarios.

How: How Profiling Data is Collected

Common tools like perf are used to collect process stack information. These tools use sampling statistics to capture stack samples executing on the CPU for performance analysis.

graph TD
    A[Sampling Trigger] -->|Interrupt| B[Sampling]
    B -->|perf_event/ebpf| C[Process Stack Addresses]
    C -->|Address Translation| D[ELF, OFFSET]
    D -->|Symbol Resolution| E[Call Stack]
    E -->|Formatting| F[pprof/perf script]
    F --> |Visualization| G[Flame Graph/Call Graph]

Trigger Mechanisms

Generally uses timer interrupts or event-counter-based strategies.

Timer Interrupts

Default fixed frequency (e.g., 99Hz) clock interrupts (SIGPROF). Shorter intervals increase precision but also overhead. Linux perf defaults to 99Hz frequency (≈10.1ms intervals).

Event-Counter Sampling

Triggers sampling when hardware performance counters (e.g., PERF_COUNT_HW_CPU_CYCLES) reach thresholds. Useful for analyzing hardware-related events like Cache Misses.

Sampling Methods

Typically, the OS kernel-provided interfaces like eBPF or perf_event are used for stack sampling.

eBPF Approach

Using eBPF programs (e.g., bpf_get_stackid), both user-space and kernel-space call stacks can be captured directly without additional stack unwinding. This method retrieves complete stack IP information.

perf_event Approach

The perf_event_open interface (e.g., perf record command) captures the instruction pointer (RIP). However, it only records the currently executing function address, not the full call stack. This means only the function name triggered by the sample can be resolved.

Example perf record output:

node 3236535 34397396.208842:     250000 cpu-clock:pppH:           110c800 v8::internal::Heap_CombinedGenerationalAndSharedBarrierSlow+0x0 (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
node 3236535 34397396.354632:     250000 cpu-clock:pppH:      7f7d63e87ef4 Builtins_LoadIC+0x574 (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)

To obtain a full call stack, tools like libunwind perform stack unwinding. For example, perf record -g generates a full stack trace by unwinding the stack frames.

Example perf record -g output:

node 3236535 34397238.259753:     250000 cpu-clock:pppH: 
            7f7d44339100 [unknown] (/tmp/perf-3236535.map)
                 18ea0dc Builtins_JSEntryTrampoline+0x5c (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
                 18e9e03 Builtins_JSEntry+0x83 (...)
...
                  c7d43f node::Start+0x58f (...)
            7f7d6ba14d90 __libc_start_call_main+0x80 (/usr/lib/x86_64-linux-gnu/libc.so.6)

Address Translation

The sampled address information corresponds to the process’s virtual addresses, such as:

7f7d44339100  
18ea0dc  
18e9e03  
106692b  
10679c4  
f2a090d  
c1c738  
...

To resolve these addresses into ELF + OFFSET for symbol translation, we use the memory mapping information from /proc/[pid]/maps. The key fields in the maps file include:

Example /proc/[pid]/maps entries:

00400000-00b81000 r--p 00000000 fc:03 550055  /root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node  
7f7d6bf3c000-7f7d6bf3d000 ---p 0021a000 fc:03 67  /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30  
7f7d6bf61000-7f7d6bf63000 r--p 00000000 fc:03 2928  /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2  

Translation Process

  1. Match the virtual address to the appropriate memory segment in /proc/[pid]/maps.
  2. Calculate the offset within the ELF file using: offset = virtual_address - segment_start + file_offset

Symbol Resolution

After translating virtual addresses into ELF + OFFSET pairs, the next step is resolving these offsets into human-readable function symbols. This involves leveraging symbol tables or debugging information embedded in the ELF files.

Methods for Symbol Resolution

  1. Using Symbol Tables Tools like nm can extract symbol information from the .dynsym (dynamic symbol table) or .symtab (static symbol table) sections of an ELF file.

Example:

# Extract malloc-related symbols from a Node.js binary
nm -D /path/to/node | grep malloc
# Output:
00000000055f9d18 D ares_malloc
0000000001f1a2a0 T ares_malloc_data
...
                 U malloc@GLIBC_2.2.5
  1. Using DWARF Debugging Information DWARF debug data provides richer details, including source file locations and variable scopes. Tools like readelf or addr2line can parse this information.

Example:

# Extract function names and source locations from DWARF info
readelf --debug-dump=info /path/to/node | grep "DW_AT_name" -A3
# Output:
<1><1980>: DW_AT_name: uv__make_close_pending
    DW_AT_decl_file: 19
    DW_AT_decl_line: 247
  1. Demangling C++ Symbols C++ symbols are often mangled (encoded) for uniqueness. Tools like c++filt restore human-readable names.

Example:

# Demangle a mangled symbol
echo "_ZN4node14ThreadPoolWork12ScheduleWorkEv" | c++filt
# Output:
node::ThreadPoolWork::ScheduleWork()

Stack Output Formatting

Resolved stack traces are formatted for analysis tools like pprof or perf script. Additional metadata (e.g., container ID, service type) may be included for aggregation.

Data Visualization

All those data above will eventually be rendered as flamegraph or call-chain graph.

When: When to Use CPU Profiling Tools

CPU profiling is most effective when analyzing CPU-bound performance issues. Below are common scenarios and their workflows:

graph TD
  A[Observe anomaly: Unavailability/Performance Jitter] --> B[Identify target process & timeframe]
  B --> C[Check core metrics: CPU, memory, disk, QPS]
  C --> D{Is CPU the bottleneck?}
  D -->|Yes| E[Profile CPU stacks]
  D -->|No| F[Use alternative tools e.g., memory profiler, I/O tracer]
  E --> G[Analyze flame graphs/call chains]
  G --> H[Root cause identified]
Scenario Category Typical Symptoms Tool Choices Data Collection Strategy
Sudden CPU Spikes Sawtooth-shaped CPU peaks in monitoring charts. Continuous Profiling Systems Capture 5-minute context before/after spikes + regular sampling.
Version Performance Regression QPS/TPS drops post-deployment. Differential FlameGraph A/B version comparison sampling under identical loads.
High CpuSys Elevated OS kernel CPU usage causing host instability. FlameGraph/Call-Chain Graph Regular sampling with kernel stack analysis.

When CPU Profiling Is NOT Suitable

For non-CPU-bound issues, profiling data may have limited value. Alternative tools are recommended:

graph TD
  A[CPU Profiling Limitations] --> B[Memory Bottlenecks]
  A --> C[I/O-Bound Workloads]
  A --> D[Lock Contention]
  A --> E[Short-lived Processes]

  B -->|Signs| B1(High page faults, GC pauses)
  B -->|Tools| B2{{Heap profiler: e.g., pprof, vmstat}}

  C -->|Signs| C1(High iowait, low CPU utilization)
  C -->|Tools| C2{{iostat, blktrace}}

  D -->|Signs| D1(High context switches, sys%)
  D -->|Tools| D2{{perf lock, lockstat}}

  E -->|Signs| E1(Process lifetime < sampling interval)
  E -->|Tools| E2{{execsnoop, dynamic tracing:e.g., bpftrace}}

References