perf进阶-event实践

perf进阶event实践

先导

perf list可以看当前软硬件环境支持的性能事件，由于性能事件过多，当程序运行异常或耗时过长时，往往不清楚查找哪一种事件。下面给出一种选择性能事件的方案：

perf list
List of pre-defined events (to be used in -e):

  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  duration_time                                      [Tool event]

  L1-dcache-load-misses                              [Hardware cache event]

示例程序

#include<iostream>
#include<cmath>
#include<vector>
#include<cstdlib>
#include<omp.h>
#include <sys/time.h>
using namespace std;

int calcu() {
    int sum = 0;
    for (int i = 0; i < 1000000; i++) {
        sum += sin((double)rand()/ RAND_MAX); 
    }
    return sum;
}

int main() {
    unsigned seed = 2;
    vector<int> res(800, 0);
    srand(seed);
    struct timeval t1,t2;
    double timeuse;
    gettimeofday(&t1,NULL);
    #pragma omp parallel for schedule(dynamic)
    for (int jobid = 0; jobid < 8; jobid++) {
        for (int i = 0; i < 100; i++) {
            res[jobid * 100 + i] = calcu();
        }
    }
    gettimeofday(&t2,NULL);
    timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
    cout<<"time = "<<timeuse<<endl;
    return 0;
}

编译运行

g++ -fopenmp main.cpp -o m -lm
numactl -C 0-7 ./m

top查看异常

(base) [root@localhost ~]# top
top - 09:39:57 up 88 days,  5:33,  3 users,  load average: 2.74, 1.13, 0.95
Tasks: 973 total,   2 running, 489 sleeping,   2 stopped,   0 zombie
%Node0 :  6.3 us, 24.2 sy,  0.0 ni, 69.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu0  : 18.1 us, 75.3 sy,  0.0 ni,  6.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 16.3 us, 77.0 sy,  0.0 ni,  6.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 19.2 us, 74.0 sy,  0.0 ni,  6.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 22.5 us, 70.8 sy,  0.0 ni,  6.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 20.8 us, 72.3 sy,  0.0 ni,  6.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 19.1 us, 74.6 sy,  0.0 ni,  6.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 20.9 us, 72.7 sy,  0.0 ni,  6.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 18.4 us, 74.9 sy,  0.0 ni,  6.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

top中us表示用户态耗时，sy表示内核空间耗时，可以看到内核态sy耗时占比过高，但内核态可能是io事件，缺页事件，线程切换事件等等，

vmstat查看具体异常

vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 1039850624   8000 27598912    0    0     0     0  159  155  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  155  142  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  206  276  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  142  168  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0   99   98  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  117  112  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  197  258  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  111  139  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  114  108  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  106  106  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  167  227  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  146  168  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  149  150  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  128  124  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  198  261  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0  175  186  0  0 100  0  0
 0  0      0 1039850624   8000 27598912    0    0     0     0   98   98  0  0 100  0  0
 #此处运行实例程序
 8  0      0 1039847616   8000 27598912    0    0     0     0 263186 524562  1  6 92  0  0
 8  0      0 1039847616   8000 27598912    0    0     0     0 263039 524333  1  6 92  0  0
 8  0      0 1039847616   8000 27598912    0    0     0     0 263137 524484  2  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263492 525213  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263466 525086  2  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263313 524761  2  6 92  0  0
 7  0      0 1039847808   8000 27598912    0    0     0     0 263365 524910  1  7 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263534 525192  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 264147 526475  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263072 524299  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263856 525886  1  6 92  0  0
 7  0      0 1039847808   8000 27598912    0    0     0     0 263943 526185  2  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263606 525400  1  6 92  0  0
 7  0      0 1039847808   8000 27598912    0    0     0     0 263725 525615  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263397 524954  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263737 525656  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263283 524787  1  6 92  0  0
 8  0      0 1039847616   8000 27598912    0    0     0     0 264191 526448  1  7 92  0  0
 8  0      0 1039847616   8000 27598912    0    0     0     0 263084 524249  1  6 92  0  0
 6  0      0 1039847616   8000 27598912    0    0     0     0 264028 526191  1  7 92  0  0
 8  0      0 1039847616   8000 27598912    0    0     0     0 263550 525169  1  6 92  0  0
 7  0      0 1039847808   8000 27598912    0    0     0    17 263382 524900  1  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 263891 525919  1  6 92  0  0
 7  0      0 1039847808   8000 27598912    0    0     0     0 264202 526549  2  6 92  0  0
 8  0      0 1039847808   8000 27598912    0    0     0     0 264129 526399  2  6 92  0  0

其中in与cs与运行前相比异常：
-in：每秒被中断的进程次数。
-cs：每秒进行的事件切换次数。
这两个数越大，代表系统与接口设备的通信越繁忙。
cs切换肯定会导致in，先查找cs比较合理。

perf追踪性能事件

查找perf是否支持cs事件

perf list | grep cs
  context-switches OR cs                             [Software event]
  scsi:scsi_dispatch_cmd_done                        [Tracepoint event]
  scsi:scsi_dispatch_cmd_error                       [Tracepoint event]
  scsi:scsi_dispatch_cmd_start                       [Tracepoint event]
  scsi:scsi_dispatch_cmd_timeout                     [Tracepoint event]
  scsi:scsi_eh_wakeup                                [Tracepoint event]
  ucsi:ucsi_connector_change                         [Tracepoint event]
  ucsi:ucsi_register_altmode                         [Tracepoint event]
  ucsi:ucsi_register_port                            [Tracepoint event]
  ucsi:ucsi_reset_ppm                                [Tracepoint event]
  ucsi:ucsi_run_command                              [Tracepoint event]

第一行发现支持。
执行perf 查看cs事件，

numactl -C 0-7 perf record -g -e cs ./m
perf report

在这里插入图片描述
通过展开main可以发现在调用rand()时，函数内部会使用do_futex（对线程上锁）导致线程频繁切换至就绪队列等待(futex_wait/futex_wait_queue_me)，使得程序耗时在多线程切换中。
得出结论，影响多线程性能的代码为rand()。

一种解决方案

调用c++标准库中的随机数生成函数，default_random_engine dre;
代码如下：

int calcu() {
    int sum = 0;
    default_random_engine dre;
    for (int i = 0; i < 1000000; i++) {
        sum += sin((double)dre()/ RAND_MAX); 
    }
    return sum;
}

查看top是否会频繁陷入内核态。

top - 10:11:01 up 88 days,  6:04,  3 users,  load average: 0.01, 0.38, 1.20
Tasks: 969 total,   3 running, 488 sleeping,   4 stopped,   0 zombie
%Node0 : 24.6 us,  0.0 sy,  0.0 ni, 75.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu0  : 74.1 us,  0.0 sy,  0.0 ni, 25.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 74.1 us,  0.0 sy,  0.0 ni, 25.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 73.8 us,  0.0 sy,  0.0 ni, 26.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 73.8 us,  0.0 sy,  0.0 ni, 26.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 73.8 us,  0.0 sy,  0.0 ni, 26.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 73.8 us,  0.0 sy,  0.0 ni, 26.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 73.8 us,  0.0 sy,  0.0 ni, 26.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 74.1 us,  0.0 sy,  0.0 ni, 25.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

耗时比较：

#rand()随机数
]$ numactl -C 0-7 ./m
time = 171.978
#default_random_engine生成随机数
]$ g++ -fopenmp main.cpp -o m -lm
]$ numactl -C 0-7 ./m
time = 4.83206