Re: [PATCH 7/8] perf/hw_breakpoint: Optimize task_bp_pinned() if CPU-independent

From: Dmitry Vyukov
Date: Fri Jun 10 2022 - 05:15:26 EST


On Fri, 10 Jun 2022 at 10:25, Marco Elver <elver@xxxxxxxxxx> wrote:
>
> On Thu, 9 Jun 2022 at 17:00, 'Dmitry Vyukov' via kasan-dev
> <kasan-dev@xxxxxxxxxxxxxxxx> wrote:
> >
> > On Thu, 9 Jun 2022 at 13:31, Marco Elver <elver@xxxxxxxxxx> wrote:
> > >
> > > Running the perf benchmark with (note: more aggressive parameters vs.
> > > preceding changes, but same host with 256 CPUs):
> > >
> > > | $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512
> > > | # Running 'breakpoint/thread' benchmark:
> > > | # Created/joined 100 threads with 4 breakpoints and 128 parallelism
> > > | Total time: 1.953 [sec]
> > > |
> > > | 38.146289 usecs/op
> > > | 4882.725000 usecs/op/cpu
> > >
> > > 16.29% [kernel] [k] rhashtable_jhash2
> > > 16.19% [kernel] [k] osq_lock
> > > 14.22% [kernel] [k] queued_spin_lock_slowpath
> > > 8.58% [kernel] [k] task_bp_pinned
> > > 8.30% [kernel] [k] mutex_spin_on_owner
> > > 4.03% [kernel] [k] smp_cfm_core_cond
> > > 2.97% [kernel] [k] toggle_bp_slot
> > > 2.94% [kernel] [k] bcmp
> > >
> > > We can see that a majority of the time is now spent hashing task
> > > pointers to index into task_bps_ht in task_bp_pinned().
> > >
> > > However, if task_bp_pinned()'s computation is independent of any CPU,
> > > i.e. always `iter->cpu < 0`, the result for each invocation will be
> > > identical. With increasing CPU-count, this problem worsens.
> > >
> > > Instead, identify if every call to task_bp_pinned() is CPU-independent,
> > > and cache the result. Use the cached result instead of a call to
> > > task_bp_pinned(), now __task_bp_pinned(), with task_bp_pinned() deciding
> > > if the cached result can be used.
> > >
> > > After this optimization:
> > >
> > > 21.96% [kernel] [k] queued_spin_lock_slowpath
> > > 16.39% [kernel] [k] osq_lock
> > > 9.82% [kernel] [k] toggle_bp_slot
> > > 9.81% [kernel] [k] find_next_bit
> > > 4.93% [kernel] [k] mutex_spin_on_owner
> > > 4.71% [kernel] [k] smp_cfm_core_cond
> > > 4.30% [kernel] [k] __reserve_bp_slot
> > > 2.65% [kernel] [k] cpumask_next
> > >
> > > Showing that the time spent hashing keys has become insignificant.
> > >
> > > With the given benchmark parameters, however, we see no statistically
> > > significant improvement in performance on the test system with 256 CPUs.
> > > This is very likely due to the benchmark parameters being too aggressive
> > > and contention elsewhere becoming dominant.
> > >
> > > Indeed, when using the less aggressive parameters from the preceding
> > > changes, we now observe:
> > >
> > > | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64
> > > | # Running 'breakpoint/thread' benchmark:
> > > | # Created/joined 30 threads with 4 breakpoints and 64 parallelism
> > > | Total time: 0.071 [sec]
> > > |
> > > | 37.134896 usecs/op
> > > | 2376.633333 usecs/op/cpu
> > >
> > > Which is an improvement of 12% compared to without this optimization
> > > (baseline is 42 usecs/op). This is now only 5% slower than the
> > > theoretical ideal (constraints disabled), and 18% slower than no
> > > breakpoints at all.
> > >
> > > [ While we're here, swap task_bp_pinned()'s bp and cpu arguments to be
> > > more consistent with other functions (which have bp first, before the
> > > cpu argument). ]
> >
> > There are 3 main cases:
> > 1. Per-cpu bp.
>
> Yes, CPU-target breakpoint on just 1 CPU.
>
> > 2. Per-task and per-cpu bp.
>
> Task-target breakpoint but pinned to 1 CPU.
>
> > 3. Per-task bp (on all cpus)
>
> Task-target breakpoint that can run on all CPUs.
>
> > right?
> >
> > For case 1 we still seem to do lots of unnecessary work in
> > fetch_bp_busy_slots() by iterating over all CPUs. We are going to bump
> > only the CPU's cpu_pinned, so that's the only CPU we need to
> > fetch/check.
>
> It'll just do 1 iteration, because cpumask_of_bp() will return a mask
> with just the event's target CPU in it.

You are right. I missed the use of cpumask_of_bp().

> > For case 2 we also do lots of unnecessary work, again we also need to
> > check only 1 CPU (don't need cached_tbp_pinned). Also don't need to do
> > atomic_dec/inc on all other CPUs (they dec/inc the same variable).
>
> Same as above, just 1 iteration because cpumask_of_bp() does the right
> thing. cached_tbp_pinned may still be used if all existing task
> breakpoints are CPU-independent (i.e. cpu==-1; granted, doing
> task_bp_pinned() once or twice probably is irrelevant in this case).
>
> > Case 3 is the only one when we need to check all CPUs and
> > cached_tbp_pinned may be useful.
> > But I wonder if we could instead add a per-task
> > has_per_cpu_breakpoints flag. Then if the flag is set, we check all
> > CPUs as we do now (don't need cached_tbp_pinned). And if it's not set,
> > then we could optimize the code even more by making it O(1) instead of
> > O(N).
>
> > Namely, we add global tsk_pinned for tasks that don't have
> > per-cpu breakpoints, and we update only that tsk_pinned instead of
> > iterating over all CPUs.
>
> That seems reasonable.
>
> > I think this will require adding cpu_pinned as well (similar to
> > tsk_pinned but aggregated over all CPUs).
>
> > Then the fast path capacity check can become just:
> >
> > if (bp->hw.target && !bp->hw.target->has_per_cpu_breakpoints && bp->cpu < 0) {
> > if (max_cpu_bp_pinned(type) + task_bp_pinned(-1 /*cpu*/, bp, type) +
> > hw_breakpoint_weight(bp) > nr_slots[type])
> > return -ENOSPC;
> > }
> >
> > Does it make any sense?
>
> Yes, I think this might work. I'll see if I can make it work.

Actually!
This is somewhat orthogonal to the optimizations you are doing, but
the most interesting case for us is inherited events. And it seems
that an inherited event can't possibly overflow the capacity.
Inherited events are a subset of the parent events and all parent
events have already passed validation and the child can't have its own
new events when inherited events are created.
So couldn't we somehow detect that reserve_bp_slot() is called from
inherit_event() and skip fetch_bp_busy_slots() altogether? Maybe that
can be detected by looking at bp->attr.inherit and presence of parent
context? Capacity validation may be kept as a debug-only check.