Re: [PATCH] perf/core: Optimize event reschedule for a PMU
From: Namhyung Kim
Date: Tue Aug 06 2024 - 02:26:13 EST
Hi Mingwei,
On Mon, Aug 5, 2024 at 9:57 AM Mingwei Zhang <mizhang@xxxxxxxxxx> wrote:
>
> On Tue, Jul 30, 2024 at 12:19 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >
> > Current ctx_resched() reschedules every events in all PMUs in the
> > context even if it only needs to do it for a single event. This is the
> > case when it opens a new event or enables an existing one. What we
> > want is to reschedule events in the PMU only. Also perf_pmu_resched()
> > currently calls ctx_resched() without PMU information.
> >
> > Let's add __perf_pmu_resched() to do the work for the given PMU only.
> > The context time should be updated by ctx_sched_{out,in}(EVENT_TIME)
> > outside from it. And change the __pmu_ctx_sched_in() to be symmetrical
> > to the _sched_out() for its arguments so that it can be called easily
> > in the __perf_pmu_resched().
> >
> > Note that __perf_install_in_context() should call ctx_resched() for the
> > very first event in the context in order to set ctx->is_active. Later
> > events can be handled by __perf_pmu_resched().
> >
> > Care should be taken when it installs a task event for a PMU and
> > there's no CPU event for the PMU. __perf_pmu_resched() will ask the
> > CPU PMU context to schedule any events in it according to the group
> > info. But as the PMU context was not activated, it didn't set the
> > event context pointer. So I added new NULL checks in the
> > __pmu_ctx_sched_{in,out}.
> >
> > With this change I can get 4x speed up (but actually it's proportional
> > to the number of uncore PMU events) on a 2-socket Intel EMR machine in
> > opening and closing a perf event for the core PMU in a loop while there
> > are a bunch of uncore PMU events active on the CPU. The test code
> > (stress-pmu) follows below.
> >
> > Before)
> > # ./stress-pmu
> > delta: 0.087068 sec (870 usec/op)
>
> Hi Namhyung,
>
> I wonder how I could test the performance boost on the virtualized
> environment. So, I assume this will have a better performance by
> reducing the number of wrmsrs to event selectors and counters?
Right.
>
> I wonder if I need to run multiple instances of stress-pmu to increase
> the number of PMU context switches?
Yep, I think it'd work. Basically anything that opens more events in
different PMUs. But make sure the vcpu thread is running on the
affected CPU (60 in my test).
Thanks,
Namhyung