Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

From: Song Liu
Date: Thu Mar 18 2021 - 20:22:59 EST




> On Mar 18, 2021, at 5:09 PM, Arnaldo <arnaldo.melo@xxxxxxxxx> wrote:
>
>
>
> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
>> On Thu, Mar 18, 2021 at 03:52:51AM +0000, Song Liu wrote:
>>>
>>>
>>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
>> <acme@xxxxxxxxxx> wrote:
>>>>
>>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>>>>> Hi Song,
>>>>>
>>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu <songliubraving@xxxxxx>
>> wrote:
>>>>>>
>>>>>> perf uses performance monitoring counters (PMCs) to monitor
>> system
>>>>>> performance. The PMCs are limited hardware resources. For
>> example,
>>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>>>
>>>>>> Modern data center systems use these PMCs in many different ways:
>>>>>> system level monitoring, (maybe nested) container level
>> monitoring, per
>>>>>> process monitoring, profiling (in sample mode), etc. In some
>> cases,
>>>>>> there are more active perf_events than available hardware PMCs.
>> To allow
>>>>>> all perf_events to have a chance to run, it is necessary to do
>> expensive
>>>>>> time multiplexing of events.
>>>>>>
>>>>>> On the other hand, many monitoring tools count the common metrics
>> (cycles,
>>>>>> instructions). It is a waste to have multiple tools create
>> multiple
>>>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>>>
>>>>> Right, it'd be really helpful when the PMCs are frequently or
>> mostly shared.
>>>>> But it'd also increase the overhead for uncontended cases as BPF
>> programs
>>>>> need to run on every context switch. Depending on the workload,
>> it may
>>>>> cause a non-negligible performance impact. So users should be
>> aware of it.
>>>>
>>>> Would be interesting to, humm, measure both cases to have a firm
>> number
>>>> of the impact, how many instructions are added when sharing using
>>>> --bpf-counters?
>>>>
>>>> I.e. compare the "expensive time multiplexing of events" with its
>>>> avoidance by using --bpf-counters.
>>>>
>>>> Song, have you perfmormed such measurements?
>>>
>>> I have got some measurements with perf-bench-sched-messaging:
>>>
>>> The system: x86_64 with 23 cores (46 HT)
>>>
>>> The perf-stat command:
>>> perf stat -e
>> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles <target,
>> etc.>
>>>
>>> The benchmark command and output:
>>> ./perf bench sched messaging -g 40 -l 50000 -t
>>> # Running 'sched/messaging' benchmark:
>>> # 20 sender and receiver threads per group
>>> # 40 groups == 1600 threads run
>>> Total time: 10X.XXX [sec]
>>>
>>>
>>> I use the "Total time" as measurement, so smaller number is better.
>>>
>>> For each condition, I run the command 5 times, and took the median of
>>
>>> "Total time".
>>>
>>> Baseline (no perf-stat) 104.873 [sec]
>>> # global
>>> perf stat -a 107.887 [sec]
>>> perf stat -a --bpf-counters 106.071 [sec]
>>> # per task
>>> perf stat 106.314 [sec]
>>> perf stat --bpf-counters 105.965 [sec]
>>> # per cpu
>>> perf stat -C 1,3,5 107.063 [sec]
>>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec]
>>
>> I can't see why it's actualy faster than normal perf ;-)
>> would be worth to find out
>
> Isn't this all about contended cases?

Yeah, the normal perf is doing time multiplexing; while --bpf-counters
doesn't need it.

Thanks,
Song