Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations
From: Marcelo Tosatti
Date: Fri Jun 28 2024 - 14:48:21 EST
On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote:
> On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote:
> > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote:
> > > Hi,
> > >
> > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES
> > > section in MAINTAINERS so I've added folks from there in my reply.
> >
> > Thanks!
> >
> > > Link to full series:
> > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@xxxxxxxxxx/
> > >
> >
> > And apologies to Leonardo... I think this is a follow-up of:
> >
> > https://lpc.events/event/17/contributions/1484/
> >
> > and I did remember we had a quick chat after that which I suggested it's
> > better to change to a different name, sorry that I never found time to
> > write a proper rely to your previous seriese [1] as promised.
> >
> > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@xxxxxxxxxx/
>
> That's correct, I commented about this in the end of above presentation.
> Don't worry, and thanks for suggesting the per-cpu naming, it was very
> helpful on designing this solution.
>
> >
> > > On 6/22/24 5:58 AM, Leonardo Bras wrote:
> > > > The problem:
> > > > Some places in the kernel implement a parallel programming strategy
> > > > consisting on local_locks() for most of the work, and some rare remote
> > > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > > kernels, even though the very few remote operations will be expensive due
> > > > to scheduling overhead.
> > > >
> > > > On the other hand, for RT workloads this can represent a problem: getting
> > > > an important workload scheduled out to deal with remote requests is
> > > > sure to introduce unexpected deadline misses.
> > > >
> > > > The idea:
> > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > > In this case, instead of scheduling work on a remote cpu, it should
> > > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > > work locally. Tha major cost, which is un/locking in every local function,
> > > > already happens in PREEMPT_RT.
> > >
> > > I've also noticed this a while ago (likely in the context of rewriting SLUB
> > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of
> > > the idea. But I forgot the details about why, so I'll let the the locking
> > > experts reply...
> > >
> >
> > I think it's a good idea, especially the new name is less confusing ;-)
> > So I wonder Thomas' thoughts as well.
>
> Thanks!
>
> >
> > And I think a few (micro-)benchmark numbers will help.
>
> Last year I got some numbers on how replacing local_locks with
> spinlocks would impact memcontrol.c cache operations:
>
> https://lore.kernel.org/all/20230125073502.743446-1-leobras@xxxxxxxxxx/
>
> tl;dr: It increased clocks spent in the most common this_cpu operations,
> while reducing clocks spent in remote operations (drain_all_stock).
>
> In RT case, since local locks are already spinlocks, this cost is
> already paid, so we can get results like these:
>
> drain_all_stock
> cpus Upstream Patched Diff (cycles) Diff(%)
> 1 44331.10831 38978.03581 -5353.072507 -12.07520567
> 8 43992.96512 39026.76654 -4966.198572 -11.2886198
> 128 156274.6634 58053.87421 -98220.78915 -62.85138425
>
> Upstream: Clocks to schedule work on remote CPU (performing not accounted)
> Patched: Clocks to grab remote cpu's spinlock and perform the needed work
> locally.
>
> Do you have other suggestions to use as (micro-) benchmarking?
>
> Thanks!
> Leo
One improvement which was noted when mm/page_alloc.c was converted to
spinlock + remote drain was that, it can bypass waiting for kwork
to be scheduled (on heavily loaded CPUs).
commit 443c2accd1b6679a1320167f8f56eed6536b806e
Author: Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx>
Date: Fri Jun 24 13:54:22 2022 +0100
mm/page_alloc: remotely drain per-cpu lists
Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu
drain work queued by __drain_all_pages(). So introduce a new mechanism to
remotely drain the per-cpu lists. It is made possible by remotely locking
'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new
scheme is that drain operations are now migration safe.
There was no observed performance degradation vs. the previous scheme.
Both netperf and hackbench were run in parallel to triggering the
__drain_all_pages(NULL, true) code path around ~100 times per second. The
new scheme performs a bit better (~5%), although the important point here
is there are no performance regressions vs. the previous mechanism.
Per-cpu lists draining happens only in slow paths.
Minchan Kim tested an earlier version and reported;
My workload is not NOHZ CPUs but run apps under heavy memory
pressure so they goes to direct reclaim and be stuck on
drain_all_pages until work on workqueue run.
unit: nanosecond
max(dur) avg(dur) count(dur)
166713013 487511.77786438033 1283
From traces, system encountered the drain_all_pages 1283 times and
worst case was 166ms and avg was 487us.
The other problem was alloc_contig_range in CMA. The PCP draining
takes several hundred millisecond sometimes though there is no
memory pressure or a few of pages to be migrated out but CPU were
fully booked.
Your patch perfectly removed those wasted time.