Re: [PATCH v6] perf: Sharing PMU counters across compatible events

From: Song Liu
Date: Wed Nov 06 2019 - 12:40:45 EST




> On Nov 6, 2019, at 1:14 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Nov 05, 2019 at 11:06:06PM +0000, Song Liu wrote:
>>
>>
>>> On Nov 5, 2019, at 12:16 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>
>>> On Tue, Nov 05, 2019 at 05:11:08PM +0000, Song Liu wrote:
>>>
>>>>> I think we can use one of the event as master. We need to be careful when
>>>>> the master event is removed, but it should be doable. Let me try.
>>>>
>>>> Actually, there is a bigger issue when we use one event as the master: what
>>>> shall we do if the master event is not running? Say it is an cgroup event,
>>>> and the cgroup is not running on this cpu. An extra master (and all these
>>>> array hacks) help us get O(1) complexity in such scenario.
>>>>
>>>> Do you have suggestions on how to solve this problem? Maybe we can keep the
>>>> extra master, and try get rid of the double alloc?
>>>
>>> Right, you have to consider scope when sharing. The master should be the
>>> largest scope event and any slaves should be complete subsets.
>>>
>>> Without much thought this seems a fairly straight forward constraint;
>>> that is, given cgroups I'm not immediately seeing how we can violate
>>> that.
>>>
>>> Basically, pick the cgroup event nearest to the root as the master.
>>> We have to have logic to re-elect the master anyway for deletion, so
>>> changing it on add shouldn't be different.
>>>
>>> (obviously the root-cgroup is cpu-wide and always on, and if you have
>>> two events from disjoint subtrees they have no overlap, so it doesn't
>>> make sense to share anyway)
>>
>> Hmm... I didn't think about cgroup structure with this much detail. And
>> this is very interesting idea.
>>
>> OTOH, non-cgroup event could also be inactive. For example, when we have
>> to rotate events, we may schedule slave before master.
>
> Right, although I suppose in that case you can do what you did in your
> patch here. If someone did IOC_DISABLE on the master, we'd have to
> re-elect a master -- obviously (and IOC_ENABLE).

Re-elect master on IOC_DISABLE is good. But we still need work for ctx
rotation. Otherwise, we need keep the master on at all time.

>
>> And if the master is in an event group, it will be more complicated...
>
> Hurmph, do you actually have that use-case? And yes, this one is tricky.
>
> Would it be sufficient if we disallow group events to be master (but
> allow them to be slaves) ?

Maybe we can solve this with an extra "first_active" pointer in perf_event.
first_active points to the first event that being added by event_pmu_add().
Then we need something like:

event_pmu_add(event)
{
if (event->dup_master->first_active) {
/* sync with first_active */
} else {
/* this event will be the first_active */
event->dup_master->first_active = event;
pmu->add(event);
}
}

However, I just realized the event_pmu_del() path need some more thoughts,
because first_active is likely the first one get sched_out().

Merging another email here:

>> If we do GFP_ATOMIC in perf_event_alloc(), maybe with an extra option, we
>> don't need the tmp_master hack. So we only allocate master when we will
>> use it.
>
> You can't, that's broken on -RT. ctx->lock is a raw_spinlock_t and
> allocator locks are spinlock_t.

How about we add another step in __perf_install_in_context(), like

__perf_install_in_context()
{
bool alloc_master;

perf_ctx_lock();
alloc_master = find_new_sharing(event, ctx);
perf_ctx_unlock();

if (alloc_master)
event->dup_master = perf_event_alloc();

/* existing logic of __perf_install_in_context() */

}

In this way, we only allocate the master event when necessary, and it
is outside of the locks.

Thanks,
Song