perf进阶-event实践
perf进阶event实践
先导
perf list可以看当前软硬件环境支持的性能事件,由于性能事件过多,当程序运行异常或耗时过长时,往往不清楚查找哪一种事件。下面给出一种选择性能事件的方案:
perf list
List of pre-defined events (to be used in -e):
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
duration_time [Tool event]
L1-dcache-load-misses [Hardware cache event]
示例程序
#include<iostream>
#include<cmath>
#include<vector>
#include<cstdlib>
#include<omp.h>
#include <sys/time.h>
using namespace std;
int calcu() {
int sum = 0;
for (int i = 0; i < 1000000; i++) {
sum += sin((double)rand()/ RAND_MAX);
}
return sum;
}
int main() {
unsigned seed = 2;
vector<int> res(800, 0);
srand(seed);
struct timeval t1,t2;
double timeuse;
gettimeofday(&t1,NULL);
#pragma omp parallel for schedule(dynamic)
for (int jobid = 0; jobid < 8; jobid++) {
for (int i = 0; i < 100; i++) {
res[jobid * 100 + i] = calcu();
}
}
gettimeofday(&t2,NULL);
timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
cout<<"time = "<<timeuse<<endl;
return 0;
}
编译运行
g++ -fopenmp main.cpp -o m -lm
numactl -C 0-7 ./m
top查看异常
(base) [root@localhost ~]# top
top - 09:39:57 up 88 days, 5:33, 3 users, load average: 2.74, 1.13, 0.95
Tasks: 973 total, 2 running, 489 sleeping, 2 stopped, 0 zombie
%Node0 : 6.3 us, 24.2 sy, 0.0 ni, 69.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu0 : 18.1 us, 75.3 sy, 0.0 ni, 6.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 16.3 us, 77.0 sy, 0.0 ni, 6.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 19.2 us, 74.0 sy, 0.0 ni, 6.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 22.5 us, 70.8 sy, 0.0 ni, 6.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 20.8 us, 72.3 sy, 0.0 ni, 6.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 19.1 us, 74.6 sy, 0.0 ni, 6.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 20.9 us, 72.7 sy, 0.0 ni, 6.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 18.4 us, 74.9 sy, 0.0 ni, 6.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
top中us表示用户态耗时,sy表示内核空间耗时,可以看到内核态sy耗时占比过高,但内核态可能是io事件,缺页事件,线程切换事件等等,
vmstat查看具体异常
vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 1039850624 8000 27598912 0 0 0 0 159 155 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 155 142 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 206 276 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 142 168 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 99 98 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 117 112 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 197 258 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 111 139 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 114 108 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 106 106 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 167 227 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 146 168 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 149 150 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 128 124 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 198 261 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 175 186 0 0 100 0 0
0 0 0 1039850624 8000 27598912 0 0 0 0 98 98 0 0 100 0 0
#此处运行实例程序
8 0 0 1039847616 8000 27598912 0 0 0 0 263186 524562 1 6 92 0 0
8 0 0 1039847616 8000 27598912 0 0 0 0 263039 524333 1 6 92 0 0
8 0 0 1039847616 8000 27598912 0 0 0 0 263137 524484 2 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263492 525213 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263466 525086 2 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263313 524761 2 6 92 0 0
7 0 0 1039847808 8000 27598912 0 0 0 0 263365 524910 1 7 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263534 525192 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 264147 526475 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263072 524299 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263856 525886 1 6 92 0 0
7 0 0 1039847808 8000 27598912 0 0 0 0 263943 526185 2 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263606 525400 1 6 92 0 0
7 0 0 1039847808 8000 27598912 0 0 0 0 263725 525615 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263397 524954 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263737 525656 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263283 524787 1 6 92 0 0
8 0 0 1039847616 8000 27598912 0 0 0 0 264191 526448 1 7 92 0 0
8 0 0 1039847616 8000 27598912 0 0 0 0 263084 524249 1 6 92 0 0
6 0 0 1039847616 8000 27598912 0 0 0 0 264028 526191 1 7 92 0 0
8 0 0 1039847616 8000 27598912 0 0 0 0 263550 525169 1 6 92 0 0
7 0 0 1039847808 8000 27598912 0 0 0 17 263382 524900 1 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 263891 525919 1 6 92 0 0
7 0 0 1039847808 8000 27598912 0 0 0 0 264202 526549 2 6 92 0 0
8 0 0 1039847808 8000 27598912 0 0 0 0 264129 526399 2 6 92 0 0
其中in与cs与运行前相比异常:
-in:每秒被中断的进程次数。
-cs:每秒进行的事件切换次数。
这两个数越大,代表系统与接口设备的通信越繁忙。
cs切换肯定会导致in,先查找cs比较合理。
perf追踪性能事件
查找perf是否支持cs事件
perf list | grep cs
context-switches OR cs [Software event]
scsi:scsi_dispatch_cmd_done [Tracepoint event]
scsi:scsi_dispatch_cmd_error [Tracepoint event]
scsi:scsi_dispatch_cmd_start [Tracepoint event]
scsi:scsi_dispatch_cmd_timeout [Tracepoint event]
scsi:scsi_eh_wakeup [Tracepoint event]
ucsi:ucsi_connector_change [Tracepoint event]
ucsi:ucsi_register_altmode [Tracepoint event]
ucsi:ucsi_register_port [Tracepoint event]
ucsi:ucsi_reset_ppm [Tracepoint event]
ucsi:ucsi_run_command [Tracepoint event]
第一行发现支持。
执行perf 查看cs事件,
numactl -C 0-7 perf record -g -e cs ./m
perf report

通过展开main可以发现在调用rand()时,函数内部会使用do_futex(对线程上锁)导致线程频繁切换至就绪队列等待(futex_wait/futex_wait_queue_me),使得程序耗时在多线程切换中。
得出结论,影响多线程性能的代码为rand()。
一种解决方案
调用c++标准库中的随机数生成函数,default_random_engine dre;
代码如下:
int calcu() {
int sum = 0;
default_random_engine dre;
for (int i = 0; i < 1000000; i++) {
sum += sin((double)dre()/ RAND_MAX);
}
return sum;
}
查看top是否会频繁陷入内核态。
top - 10:11:01 up 88 days, 6:04, 3 users, load average: 0.01, 0.38, 1.20
Tasks: 969 total, 3 running, 488 sleeping, 4 stopped, 0 zombie
%Node0 : 24.6 us, 0.0 sy, 0.0 ni, 75.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu0 : 74.1 us, 0.0 sy, 0.0 ni, 25.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 74.1 us, 0.0 sy, 0.0 ni, 25.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 73.8 us, 0.0 sy, 0.0 ni, 26.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 73.8 us, 0.0 sy, 0.0 ni, 26.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 73.8 us, 0.0 sy, 0.0 ni, 26.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 73.8 us, 0.0 sy, 0.0 ni, 26.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 73.8 us, 0.0 sy, 0.0 ni, 26.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 74.1 us, 0.0 sy, 0.0 ni, 25.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
耗时比较:
#rand()随机数
]$ numactl -C 0-7 ./m
time = 171.978
#default_random_engine生成随机数
]$ g++ -fopenmp main.cpp -o m -lm
]$ numactl -C 0-7 ./m
time = 4.83206