Re: [PATCH v6] perf: Sharing PMU counters across compatible events

From: Song Liu
Date: Wed Nov 06 2019 - 17:23:43 EST




> On Nov 6, 2019, at 12:44 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 06, 2019 at 05:40:29PM +0000, Song Liu wrote:
>>> On Nov 6, 2019, at 1:14 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>>>> OTOH, non-cgroup event could also be inactive. For example, when we have
>>>> to rotate events, we may schedule slave before master.
>>>
>>> Right, although I suppose in that case you can do what you did in your
>>> patch here. If someone did IOC_DISABLE on the master, we'd have to
>>> re-elect a master -- obviously (and IOC_ENABLE).
>>
>> Re-elect master on IOC_DISABLE is good. But we still need work for ctx
>> rotation. Otherwise, we need keep the master on at all time.
>
> I meant to says that for the rotation case we can do as you did here, if
> we do add() on a slave, add the master if it wasn't add()'ed yet.

Maybe an "add-but-don't-count" state would solve this, even with event groups?
Say "PERF_EVENT_STATE_ACTIVE_NOT_COUNTING". Let me think more about it.

>
>>>> And if the master is in an event group, it will be more complicated...
>>>
>>> Hurmph, do you actually have that use-case? And yes, this one is tricky.
>>>
>>> Would it be sufficient if we disallow group events to be master (but
>>> allow them to be slaves) ?
>>
>> Maybe we can solve this with an extra "first_active" pointer in perf_event.
>> first_active points to the first event that being added by event_pmu_add().
>> Then we need something like:
>>
>> event_pmu_add(event)
>> {
>> if (event->dup_master->first_active) {
>> /* sync with first_active */
>> } else {
>> /* this event will be the first_active */
>> event->dup_master->first_active = event;
>> pmu->add(event);
>> }
>> }
>
> I'm confused on what exactly you're trying to solve with the
> first_active thing. The problem with the group event as master is that
> you then _must_ schedule the whole group, which is obviously difficult.

With first_active, we are not required to schedule the master. A slave
could be the first_active, and other slaves could read data from it.

For group event use cases, I think only allowing non-group event to be
the master would be a good start.

>
>>>> If we do GFP_ATOMIC in perf_event_alloc(), maybe with an extra option, we
>>>> don't need the tmp_master hack. So we only allocate master when we will
>>>> use it.
>>>
>>> You can't, that's broken on -RT. ctx->lock is a raw_spinlock_t and
>>> allocator locks are spinlock_t.
>>
>> How about we add another step in __perf_install_in_context(), like
>>
>> __perf_install_in_context()
>> {
>> bool alloc_master;
>>
>> perf_ctx_lock();
>> alloc_master = find_new_sharing(event, ctx);
>> perf_ctx_unlock();
>>
>> if (alloc_master)
>> event->dup_master = perf_event_alloc();
>> /* existing logic of __perf_install_in_context() */
>>
>> }
>>
>> In this way, we only allocate the master event when necessary, and it
>> is outside of the locks.
>
> It's still broken on -RT, because __perf_install_in_context() is in
> hardirq context (IPI) and the allocator locks are spinlock_t.

Hmm... how about perf_install_in_context()? Something like:

diff --git i/kernel/events/core.c w/kernel/events/core.c
index e8bec0823763..f55a7a8b9de4 100644
--- i/kernel/events/core.c
+++ w/kernel/events/core.c
@@ -2860,6 +2860,13 @@ perf_install_in_context(struct perf_event_context *ctx,
*/
smp_store_release(&event->ctx, ctx);

+ raw_spin_lock_irq(&ctx->lock);
+ alloc_master = find_new_sharing(event, ctx);
+ raw_spin_unlock_irq(&ctx->lock);
+
+ if (alloc_master)
+ event->dup_master = perf_event_alloc(xxx);
+

If this works, we won't need PERF_EVENT_STATE_ACTIVE_NOT_COUNTING.

Thanks,
Song