Re: [PATCH v6] perf: Sharing PMU counters across compatible events

From: Song Liu
Date: Tue Nov 05 2019 - 12:11:27 EST



Hi Peter,

> On Oct 31, 2019, at 9:29 AM, Song Liu <songliubraving@xxxxxx> wrote:
>
>> On Oct 31, 2019, at 5:43 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>> On Wed, Sep 18, 2019 at 10:23:14PM -0700, Song Liu wrote:
>>> This patch tries to enable PMU sharing. To make perf event scheduling
>>> fast, we use special data structures.
>>>
>>> An array of "struct perf_event_dup" is added to the perf_event_context,
>>> to remember all the duplicated events under this ctx. All the events
>>> under this ctx has a "dup_id" pointing to its perf_event_dup. Compatible
>>> events under the same ctx share the same perf_event_dup. The following
>>> figure shows a simplified version of the data structure.
>>>
>>> ctx -> perf_event_dup -> master
>>> ^
>>> |
>>> perf_event /|
>>> |
>>> perf_event /
>>>
>>> Connection among perf_event and perf_event_dup are built when events are
>>> added or removed from the ctx. So these are not on the critical path of
>>> schedule or perf_rotate_context().
>>>
>>> On the critical paths (add, del read), sharing PMU counters doesn't
>>> increase the complexity. Helper functions event_pmu_[add|del|read]() are
>>> introduced to cover these cases. All these functions have O(1) time
>>> complexity.
>>>
>>> We allocate a separate perf_event for perf_event_dup->master. This needs
>>> extra attention, because perf_event_alloc() may sleep. To allocate the
>>> master event properly, a new pointer, tmp_master, is added to perf_event.
>>> tmp_master carries a separate perf_event into list_[add|del]_event().
>>> The master event has valid ->ctx and holds ctx->refcount.
>>
>> That is realy nasty and expensive, it basically means every !sampling
>> event carries a double allocate.
>>
>> Why can't we use one of the actual events as master?
>
> I think we can use one of the event as master. We need to be careful when
> the master event is removed, but it should be doable. Let me try.

Actually, there is a bigger issue when we use one event as the master: what
shall we do if the master event is not running? Say it is an cgroup event,
and the cgroup is not running on this cpu. An extra master (and all these
array hacks) help us get O(1) complexity in such scenario.

Do you have suggestions on how to solve this problem? Maybe we can keep the
extra master, and try get rid of the double alloc?

Thanks,
Song