CPU Profiling: What, How, and When
March 10, 2025 · 881 words · 5 min · Performance Analysis
What: What is CPU Profiling
A technique for analyzing program CPU performance. By collecting detailed data during program execution (such as function call frequency, time consumption, call stacks, etc.), it helps developers identify performance bottlenecks and optimize code efficiency. Typically used in performance analysis and root cause diagnosis scenarios.
How: How Profiling Data is Collected
Common tools like perf
are used to collect process stack information. These tools use sampling statistics to capture stack samples executing on the CPU for performance analysis.
graph TD
A[Sampling Trigger] -->|Interrupt| B[Sampling]
B -->|perf_event/ebpf| C[Process Stack Addresses]
C -->|Address Translation| D[ELF, OFFSET]
D -->|Symbol Resolution| E[Call Stack]
E -->|Formatting| F[pprof/perf script]
F --> |Visualization| G[Flame Graph/Call Graph]
Trigger Mechanisms
Generally uses timer interrupts or event-counter-based strategies.
Timer Interrupts
Default fixed frequency (e.g., 99Hz) clock interrupts (SIGPROF). Shorter intervals increase precision but also overhead. Linux perf defaults to 99Hz frequency (≈10.1ms intervals).
Event-Counter Sampling
Triggers sampling when hardware performance counters (e.g., PERF_COUNT_HW_CPU_CYCLES
) reach thresholds. Useful for analyzing hardware-related events like Cache Misses.
Sampling Methods
Typically, the OS kernel-provided interfaces like eBPF or perf_event are used for stack sampling.
eBPF Approach
Using eBPF programs (e.g., bpf_get_stackid), both user-space and kernel-space call stacks can be captured directly without additional stack unwinding. This method retrieves complete stack IP information.
perf_event Approach
The perf_event_open interface (e.g., perf record command) captures the instruction pointer (RIP). However, it only records the currently executing function address, not the full call stack. This means only the function name triggered by the sample can be resolved.
Example perf record output:
node 3236535 34397396.208842: 250000 cpu-clock:pppH: 110c800 v8::internal::Heap_CombinedGenerationalAndSharedBarrierSlow+0x0 (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
node 3236535 34397396.354632: 250000 cpu-clock:pppH: 7f7d63e87ef4 Builtins_LoadIC+0x574 (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
To obtain a full call stack, tools like libunwind perform stack unwinding. For example, perf record -g
generates a full stack trace by unwinding the stack frames.
Example perf record -g output:
node 3236535 34397238.259753: 250000 cpu-clock:pppH:
7f7d44339100 [unknown] (/tmp/perf-3236535.map)
18ea0dc Builtins_JSEntryTrampoline+0x5c (/root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node)
18e9e03 Builtins_JSEntry+0x83 (...)
...
c7d43f node::Start+0x58f (...)
7f7d6ba14d90 __libc_start_call_main+0x80 (/usr/lib/x86_64-linux-gnu/libc.so.6)
Address Translation
The sampled address information corresponds to the process’s virtual addresses, such as:
7f7d44339100
18ea0dc
18e9e03
106692b
10679c4
f2a090d
c1c738
...
To resolve these addresses into ELF + OFFSET for symbol translation, we use the memory mapping information from /proc/[pid]/maps
. The key fields in the maps file include:
Example /proc/[pid]/maps entries:
00400000-00b81000 r--p 00000000 fc:03 550055 /root/.vscode-server/cli/servers/Stable-e54c774e0add60467559eb0d1e229c6452cf8447/server/node
7f7d6bf3c000-7f7d6bf3d000 ---p 0021a000 fc:03 67 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30
7f7d6bf61000-7f7d6bf63000 r--p 00000000 fc:03 2928 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
Translation Process
- Match the virtual address to the appropriate memory segment in
/proc/[pid]/maps
. - Calculate the offset within the ELF file using:
offset = virtual_address - segment_start + file_offset
Symbol Resolution
After translating virtual addresses into ELF + OFFSET
pairs, the next step is resolving these offsets into human-readable function symbols. This involves leveraging symbol tables or debugging information embedded in the ELF files.
Methods for Symbol Resolution
- Using Symbol Tables Tools like nm can extract symbol information from the .dynsym (dynamic symbol table) or .symtab (static symbol table) sections of an ELF file.
Example:
# Extract malloc-related symbols from a Node.js binary
nm -D /path/to/node | grep malloc
# Output:
00000000055f9d18 D ares_malloc
0000000001f1a2a0 T ares_malloc_data
...
U malloc@GLIBC_2.2.5
- Using DWARF Debugging Information DWARF debug data provides richer details, including source file locations and variable scopes. Tools like readelf or addr2line can parse this information.
Example:
# Extract function names and source locations from DWARF info
readelf --debug-dump=info /path/to/node | grep "DW_AT_name" -A3
# Output:
<1><1980>: DW_AT_name: uv__make_close_pending
DW_AT_decl_file: 19
DW_AT_decl_line: 247
- Demangling C++ Symbols C++ symbols are often mangled (encoded) for uniqueness. Tools like c++filt restore human-readable names.
Example:
# Demangle a mangled symbol
echo "_ZN4node14ThreadPoolWork12ScheduleWorkEv" | c++filt
# Output:
node::ThreadPoolWork::ScheduleWork()
Stack Output Formatting
Resolved stack traces are formatted for analysis tools like pprof or perf script. Additional metadata (e.g., container ID, service type) may be included for aggregation.
Data Visualization
All those data above will eventually be rendered as flamegraph or call-chain graph.
When: When to Use CPU Profiling Tools
CPU profiling is most effective when analyzing CPU-bound performance issues. Below are common scenarios and their workflows:
graph TD
A[Observe anomaly: Unavailability/Performance Jitter] --> B[Identify target process & timeframe]
B --> C[Check core metrics: CPU, memory, disk, QPS]
C --> D{Is CPU the bottleneck?}
D -->|Yes| E[Profile CPU stacks]
D -->|No| F[Use alternative tools e.g., memory profiler, I/O tracer]
E --> G[Analyze flame graphs/call chains]
G --> H[Root cause identified]
Scenario Category | Typical Symptoms | Tool Choices | Data Collection Strategy |
---|---|---|---|
Sudden CPU Spikes | Sawtooth-shaped CPU peaks in monitoring charts. | Continuous Profiling Systems | Capture 5-minute context before/after spikes + regular sampling. |
Version Performance Regression | QPS/TPS drops post-deployment. | Differential FlameGraph | A/B version comparison sampling under identical loads. |
High CpuSys | Elevated OS kernel CPU usage causing host instability. | FlameGraph/Call-Chain Graph | Regular sampling with kernel stack analysis. |
When CPU Profiling Is NOT Suitable
For non-CPU-bound issues, profiling data may have limited value. Alternative tools are recommended:
graph TD
A[CPU Profiling Limitations] --> B[Memory Bottlenecks]
A --> C[I/O-Bound Workloads]
A --> D[Lock Contention]
A --> E[Short-lived Processes]
B -->|Signs| B1(High page faults, GC pauses)
B -->|Tools| B2{{Heap profiler: e.g., pprof, vmstat}}
C -->|Signs| C1(High iowait, low CPU utilization)
C -->|Tools| C2{{iostat, blktrace}}
D -->|Signs| D1(High context switches, sys%)
D -->|Tools| D2{{perf lock, lockstat}}
E -->|Signs| E1(Process lifetime < sampling interval)
E -->|Tools| E2{{execsnoop, dynamic tracing:e.g., bpftrace}}
References
-
code expaple:https://github.com/noneback/doctor
-
stack unwind: https://zhuanlan.zhihu.com/p/460686470
-
proc_pid_maps: https://man7.org/linux/man-pages/man5/proc_pid_maps.5.html
-
demange & mangle: https://www.cnblogs.com/BloodAndBone/p/7912179.html