Re: [PATCH v13] perf: Sharing PMU counters across compatible events

From: Song Liu
Date: Thu Jun 18 2020 - 16:12:24 EST




> On Jun 18, 2020, at 2:09 AM, Athira Rajeev <atrajeev@xxxxxxxxxxxxxxxxxx> wrote:
>
>
>
>> On 02-May-2020, at 5:51 AM, Song Liu <songliubraving@xxxxxx> wrote:
>>
>> This patch tries to enable PMU sharing. When multiple perf_events are
>> counting the same metric, they can share the hardware PMU counter. We
>> call these events as "compatible events".
>>
>> The PMU sharing are limited to events within the same perf_event_context
>> (ctx). When a event is installed or enabled, search the ctx for compatible
>> events. This is implemented in perf_event_setup_dup(). One of these
>> compatible events are picked as the master (stored in event->dup_master).
>> Similarly, when the event is removed or disabled, perf_event_remove_dup()
>> is used to clean up sharing.
>>
>> If the master event is a cgroup event or a event in a group, it is
>> possible that some slave events are ACTIVE, but the master event is not.
>> To handle this scenario, we introduced PERF_EVENT_STATE_ENABLED. Also,
>> since PMU drivers write into event->count, master event needs another
>> variable (master_count) for the reading of this event.
>>
>> On the critical paths (add, del read), sharing PMU counters doesn't
>> increase the complexity. Helper functions event_pmu_[add|del|read]() are
>> introduced to cover these cases. All these functions have O(1) time
>> complexity.
>>
>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
>> Cc: Jiri Olsa <jolsa@xxxxxxxxxx>
>> Cc: Alexey Budankov <alexey.budankov@xxxxxxxxxxxxxxx>
>> Cc: Namhyung Kim <namhyung@xxxxxxxxxx>
>> Cc: Tejun Heo <tj@xxxxxxxxxx>
>> Reported-by: kernel test robot <rong.a.chen@xxxxxxxxx>
>> Signed-off-by: Song Liu <songliubraving@xxxxxx>
>>
>> ---
>> Changes in v13:
>> Fix memory ordering in perf_event_exit_dup_master and perf_event_count.
>> Remove test code in perf_event_can_share().
>>
>> Changes in v12:
>> Add perf_event->dup_active_list to fix rdpmc. (kernel test robot)
>> Fix event reset. (kernel test robot)
>> Do not consider sharing when the perf event is disabled.
>>
>> Changes in v11:
>> Fold in major fixes by Peter.
>> In perf_event_remove_dup(), del() old master before add() new master.
>>
>> Changes in v10:
>> Simplify logic that calls perf_event_setup_dup() and
>> perf_event_remove_dup(). (Peter)
>> Other small fixes. (Peter)
>>
>> Changes in v9:
>> Avoid ctx_resched() on remove/disable event (Peter).
>> Compare the whole perf_event_attr in perf_event_compatible().
>> Small fixes/improvements (Peter).
>>
>> Changes in v8:
>> Fix issues with task event (Jiri).
>> Fix issues with event inherit.
>> Fix mmap'ed events, i.e. perf test 4 (kernel test bot).
>>
>> Changes in v7:
>> Major rewrite to avoid allocating extra master event.
>> (cherry picked from commit 8b98afc0ec100b017a5e80c184809939d4f01f8d)
>>
>> fix perf_event_count()
>> ---
>> arch/x86/events/core.c | 7 +
>> include/linux/perf_event.h | 22 +-
>> kernel/events/core.c | 422 ++++++++++++++++++++++++++++++++++---
>> 3 files changed, 416 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 9e63ee50b19a..a245cac31a19 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -2345,6 +2345,11 @@ static int x86_pmu_aux_output_match(struct perf_event *event)
>> return 0;
>> }
>>
>> +static void x86_copy_hw_config(struct perf_event *old, struct perf_event *new)
>> +{
>> + new->hw.idx = old->hw.idx;
>> +}
>> +
>> static struct pmu pmu = {
>> .pmu_enable = x86_pmu_enable,
>> .pmu_disable = x86_pmu_disable,
>> @@ -2373,6 +2378,8 @@ static struct pmu pmu = {
>> .check_period = x86_pmu_check_period,
>>
>> .aux_output_match = x86_pmu_aux_output_match,
>> +
>> + .copy_hw_config = x86_copy_hw_config,
>> };
>>
>> void arch_perf_update_userpage(struct perf_event *event,
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 87e21681759c..1306fbd920fd 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -503,6 +503,13 @@ struct pmu {
>> * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
>> */
>> int (*check_period) (struct perf_event *event, u64 value); /* optional */
>> +
>> + /*
>> + * Copy hw configuration from one event to another. This is used
>> + * to make switching master faster in PMC sharing.
>> + */
>> + void (*copy_hw_config) (struct perf_event *old,
>> + struct perf_event *new); /* optional */
>> };
>>
>> enum perf_addr_filter_action_t {
>> @@ -552,6 +559,10 @@ struct perf_addr_filter_range {
>>
>> /**
>> * enum perf_event_state - the states of an event:
>> + *
>> + * PERF_EVENT_STATE_ENABLED: Special state for PMC sharing: the hw PMC
>> + * is enabled, but this event is not counting.
>> + * See perf_event_init_dup_master().
>> */
>> enum perf_event_state {
>> PERF_EVENT_STATE_DEAD = -4,
>> @@ -559,7 +570,8 @@ enum perf_event_state {
>> PERF_EVENT_STATE_ERROR = -2,
>> PERF_EVENT_STATE_OFF = -1,
>> PERF_EVENT_STATE_INACTIVE = 0,
>> - PERF_EVENT_STATE_ACTIVE = 1,
>> + PERF_EVENT_STATE_ENABLED = 1,
>> + PERF_EVENT_STATE_ACTIVE = 2,
>> };
>>
>> struct file;
>> @@ -645,6 +657,7 @@ struct perf_event {
>> int group_caps;
>>
>> struct perf_event *group_leader;
>> + struct perf_event *dup_master; /* for PMU sharing */
>> struct pmu *pmu;
>> void *pmu_private;
>>
>> @@ -762,6 +775,13 @@ struct perf_event {
>> void *security;
>> #endif
>> struct list_head sb_list;
>> +
>> + int dup_active;
>> + /* See event_pmu_read_dup() */
>> + local64_t dup_count;
>> + /* See perf_event_init_dup_master() */
>> + local64_t master_count;
>> + struct list_head dup_active_list;
>> #endif /* CONFIG_PERF_EVENTS */
>> };
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 206dd29c3c6c..40c8b15407b7 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1002,7 +1002,6 @@ perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ct
>> */
>> cpuctx = container_of(ctx, struct perf_cpu_context, ctx);
>>
>> - pr_info("%s cpu %d\n", __func__, smp_processor_id());
>> /*
>> * Since setting cpuctx->cgrp is conditional on the current @cgrp
>> * matching the event's cgroup, we must do this for every new event,
>> @@ -1012,8 +1011,6 @@ perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ct
>> if (ctx->is_active && !cpuctx->cgrp) {
>> struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
>>
>> - pr_info("%s cpu %d cgrp %px\n", __func__, smp_processor_id(),
>> - cgrp);
>
> Hi Song Liu,
>
> I had problem in trying to apply this patch on latest upstream. Example, the `pr_info` in perf_cgroup_event_enable. Looks like there were some local changes on top of which this patch was created. Could you please confirm to make sure there are no other changes missing.

Sorry for the confusion. You are right that those two pr_info() are from a local
commit. I pushed latest version to

https://git.kernel.org/pub/scm/linux/kernel/git/song/linux.git/log/?h=pmu-share

Please let me know if you have problem with it.

Thanks,
Song