Re: [PATCH v4 0/4] percpu: partial chunk depopulation

From: Dennis Zhou
Date: Tue Apr 20 2021 - 10:39:10 EST


On Tue, Apr 20, 2021 at 04:37:02PM +0530, Pratik Sampat wrote:
>
> On 20/04/21 4:27 am, Dennis Zhou wrote:
> > On Mon, Apr 19, 2021 at 10:50:43PM +0000, Dennis Zhou wrote:
> > > Hello,
> > >
> > > This series is a continuation of Roman's series in [1]. It aims to solve
> > > chunks holding onto free pages by adding a reclaim process to the percpu
> > > balance work item.
> > >
> > > The main difference is that the nr_empty_pop_pages is now managed at
> > > time of isolation instead of intermixed. This helps with deciding which
> > > chunks to free instead of having to interleave returning chunks to
> > > active duty.
> > >
> > > The allocation priority is as follows:
> > > 1) appropriate chunk slot increasing until fit
> > > 2) sidelined chunks
> > > 3) full free chunks
> > >
> > > The last slot for to_depopulate is never used for allocations.
> > >
> > > A big thanks to Roman for initiating the work and being available for
> > > iterating on these ideas.
> > >
> > > This patchset contains the following 4 patches:
> > > 0001-percpu-factor-out-pcpu_check_block_hint.patch
> > > 0002-percpu-use-pcpu_free_slot-instead-of-pcpu_nr_slots-1.patch
> > > 0003-percpu-implement-partial-chunk-depopulation.patch
> > > 0004-percpu-use-reclaim-threshold-instead-of-running-for-.patch
> > >
> > > 0001 and 0002 are clean ups. 0003 implement partial chunk depopulation
> > > initially from Roman. 0004 adds a reclaim threshold so we do not need to
> > > schedule for every page freed.
> > >
> > > This series is on top of percpu$for-5.14 67c2669d69fb.
> > >
> > > diffstats below:
> > >
> > > Dennis Zhou (2):
> > > percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
> > > percpu: use reclaim threshold instead of running for every page
> > >
> > > Roman Gushchin (2):
> > > percpu: factor out pcpu_check_block_hint()
> > > percpu: implement partial chunk depopulation
> > >
> > > mm/percpu-internal.h | 5 +
> > > mm/percpu-km.c | 5 +
> > > mm/percpu-stats.c | 20 ++--
> > > mm/percpu-vm.c | 30 ++++++
> > > mm/percpu.c | 252 ++++++++++++++++++++++++++++++++++++++-----
> > > 5 files changed, 278 insertions(+), 34 deletions(-)
> > >
> > > Thanks,
> > > Dennis
> > Hello Pratik,
> >
> > Do you mind testing this series again on POWER9? The base is available
> > here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu.git/log/?h=for-5.14
> >
> > Thanks,
> > Dennis
>
> Hello Dennis, I have tested this patchset on POWER9.
>
> I have tried variations of the percpu_test in the top level and nested cgroups
> creation as the test with 1000:10 didn't show any benefits.

This is most likely because the 1 in every 11 still pins every page
while 1 in 50 does not. Can you try the patch below on top? I think it
may show slightly better perf as well. If it doesn't I'll just drop it.

>
> The following example shows more consistent benefits with the de-allocation
> strategy.
> Outer: 1000
> Inner: 50
> # ./percpu_test.sh
> Percpu: 6912 kB
> Percpu: 532736 kB
> Percpu: 278784 kB
>
> I believe it could be a result of bulk freeing within "free_unref_page_commit",
> where pages are only free'd if pcp->count >= pcp->high. As POWER has a larger
> page size it would end up creating lesser number of pages but with the
> effects of fragmentation.

This is unrelated to per cpu pages in slab/slub. Percpu is a separate
memory allocator.

>
> Having said that, the patchset and its behavior does look good to me.

Thanks, can I throw the following on the appropriate patches? In the
future it's good to be explicit about this because some prefer to credit
different emails.

Tested-by: Pratik Sampat <psampat@xxxxxxxxxxxxx>

Thanks,
Dennis

The following may do a little better on power9:
---