Re: [RFC][PATCH] perf: Rewrite core context handling

From: Alexey Budankov
Date: Tue Oct 16 2018 - 02:39:11 EST


Hi,

On 15.10.2018 21:31, Stephane Eranian wrote:
> Hi,
>
> On Mon, Oct 15, 2018 at 10:29 AM Alexey Budankov
> <alexey.budankov@xxxxxxxxxxxxxxx> wrote:
>>
>>
>> Hi,
>> On 15.10.2018 11:34, Peter Zijlstra wrote:
>>> On Mon, Oct 15, 2018 at 10:26:06AM +0300, Alexey Budankov wrote:
>>>> Hi,
>>>>
>>>> On 10.10.2018 13:45, Peter Zijlstra wrote:
>>>>> Hi all,
>>>>>
>>>>> There have been various issues and limitations with the way perf uses
>>>>> (task) contexts to track events. Most notable is the single hardware PMU
>>>>> task context, which has resulted in a number of yucky things (both
>>>>> proposed and merged).
>>>>>
>>>>> Notably:
>>>>>
>>>>> - HW breakpoint PMU
>>>>> - ARM big.little PMU
>>>>> - Intel Branch Monitoring PMU
>>>>>
>>>>> Since we now track the events in RB trees, we can 'simply' add a pmu
>>>>> order to them and have them grouped that way, reducing to a single
>>>>> context. Of course, reality never quite works out that simple, and below
>>>>> ends up adding an intermediate data structure to bridge the context ->
>>>>> pmu mapping.
>>>>>
>>>>> Something a little like:
>>>>>
>>>>> ,------------------------[1:n]---------------------.
>>>>> V V
>>>>> perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event
>>>>> ^ ^ | |
>>>>> `--------[1:n]---------' `-[n:1]-> pmu <-[1:n]-'
>>>>>
>>>>> This patch builds (provided you disable CGROUP_PERF), boots and survives
>>>>> perf-top without the machine catching fire.
>>>>>
>>>>> There's still a fair bit of loose ends (look for XXX), but I think this
>>>>> is the direction we should be going.
>>>>>
>>>>> Comments?
>>>>>
>>>>> Not-Quite-Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>>>>> ---
>>>>> arch/powerpc/perf/core-book3s.c | 4
>>>>> arch/x86/events/core.c | 4
>>>>> arch/x86/events/intel/core.c | 6
>>>>> arch/x86/events/intel/ds.c | 6
>>>>> arch/x86/events/intel/lbr.c | 16
>>>>> arch/x86/events/perf_event.h | 6
>>>>> include/linux/perf_event.h | 80 +-
>>>>> include/linux/sched.h | 2
>>>>> kernel/events/core.c | 1412 ++++++++++++++++++++--------------------
>>>>> 9 files changed, 815 insertions(+), 721 deletions(-)
>>>>
>>>> Rewrite is impressive however it doesn't result in code base reduction as it is.
>>>
>>> Yeah.. that seems to be nature of these things ..
>>>
>>>> Nonetheless there is a clear demand for per pmu events groups tracking and rotation
>>>> in single cpu context (HW breakpoints, ARM big.little, Intel LBRs) and there is
>>>> a supply thru groups ordering on RB-tree.
>>>>
>>>> This might be driven into the kernel by some new Perf features that would base on
>>>> that RB-tree groups ordering or by refactoring of existing code but in the way it
>>>> would result in overall code base reduction thus lowering support cost.
>>>
>>> If you have a concrete suggestion on how to reduce complexity? I tried,
>>> but couldn't find any (without breaking something).
>>
>> Could some of those PMUs (HW breakpoints, ARM big.little, Intel LBRs)
>> or other Perf related code be adjusted now so that overall subsystem
>> code base would reduce?
>>
> I have always had a hard time understanding the role of all these structs in
> the generic code. This is still very confusing and very hard to follow.
>
> In my mind, you have per-task and per-cpu perf_events contexts.
> And for each you can have multiple PMUs, some hw some sw.
> Each PMU has its own list of events maintained in RB tree. There is
> never any interactions between PMUs.
>
> Maybe this is how this is done or proposed by your patches, but it
> certainly is not
> obvious.
>
> Also the Intel LBR is not a PMU on is own. Maybe you are talking about
> the BTS in
> arch/x86/even/sintel/bts.c.

I am referring to Intel Branch Monitoring PMU mentioned in the description.
Thanks for correction.

- Alexey
>
>
>>>
>>> The active lists and pmu_ctx_list could arguably be replaced with
>>> (slower) iteratons over the RB tree, but you'll still need the per pmu
>>> nr_events/nr_active counts to determine if rotation is required at all.
>>>
>>> And like you know, performance is quite important here too. I'd love to
>>> reduce complexity while maintaining or improve performance, but that
>>> rarely if ever happens :/
>>>
>