Re: [PATCH RFC v2 00/10] SLUB percpu sheaves

From: Suren Baghdasaryan
Date: Sun Feb 23 2025 - 20:43:39 EST

Next message: Jason Wang: "Re: [PATCH v6 4/6] vhost: introduce worker ops to support multiple thread models"
Previous message: Dapeng Mi: "[PATCH 2/2] perf tools/tests: Fix topdown groups test on hybrid platforms"
In reply to: Suren Baghdasaryan: "Re: [PATCH RFC v2 00/10] SLUB percpu sheaves"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Feb 23, 2025 at 5:36 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet
> > <kent.overstreet@xxxxxxxxx> wrote:
> > >
> > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote:
> > > > - Cheaper fast paths. For allocations, instead of local double cmpxchg,
> > > > after Patch 5 it's preempt_disable() and no atomic operations. Same for
> > > > freeing, which is normally a local double cmpxchg only for a short
> > > > term allocations (so the same slab is still active on the same cpu when
> > > > freeing the object) and a more costly locked double cmpxchg otherwise.
> > > > The downside is the lack of NUMA locality guarantees for the allocated
> > > > objects.
> > >
> > > Is that really cheaper than a local non locked double cmpxchg?
> >
> > Don't know about this particular part but testing sheaves with maple
> > node cache and stress testing mmap/munmap syscalls shows performance
> > benefits as long as there is some delay to let kfree_rcu() do its job.
> > I'm still gathering results and will most likely post them tomorrow.
>
> Here are the promised test results:
>
> First I ran an Android app cycle test comparing the baseline against sheaves
> used for maple tree nodes (as this patchset implements). I registered about
> 3% improvement in app launch times, indicating improvement in mmap syscall
> performance.
> Next I ran an mmap stress test which maps 5 1-page readable file-backed
> areas, faults them in and finally unmaps them, timing mmap syscalls.

I forgot to mention that I also added a 500us delay after each cycle
described above to give kfree_rcu() a chance to run.

> Repeats that 200000 cycles and reports the total time. Average of 10 such
> runs is used as the final result.
> 3 configurations were tested:
>
> 1. Sheaves used for maple tree nodes only (this patchset).
>
> 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1].
> This patchset avoids allocating additional vm_lock structure on each mmap
> syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache.
>
> 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock
> to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace
> TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache.
>
> The values represent the total time it took to perform mmap syscalls, less is
> better.
>
> (1) baseline control
> Little core 7.58327 6.614939 (-12.77%)
> Medium core 2.125315 1.428702 (-32.78%)
> Big core 0.514673 0.422948 (-17.82%)
>
> (2) baseline control
> Little core 7.58327 5.141478 (-32.20%)
> Medium core 2.125315 0.427692 (-79.88%)
> Big core 0.514673 0.046642 (-90.94%)
>
> (3) baseline control
> Little core 7.58327 4.779624 (-36.97%)
> Medium core 2.125315 0.450368 (-78.81%)
> Big core 0.514673 0.037776 (-92.66%)
>
> Results in (3) vs (2) indicate that using sheaves for vm_area_struct
> yields slightly better averages and I noticed that this was mostly due
> to sheaves results missing occasional spikes that worsened
> TYPESAFE_BY_RCU averages (the results seemed more stable with
> sheaves).
>
> [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@xxxxxxxxxx/
>
> >
> > >
> > > Especially if you now have to use pushf/popf...
> > >
> > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
> > > > separate percpu sheaf and only submit the whole sheaf to call_rcu()
> > > > when full. After the grace period, the sheaf can be used for
> > > > allocations, which is more efficient than freeing and reallocating
> > > > individual slab objects (even with the batching done by kfree_rcu()
> > > > implementation itself). In case only some cpus are allowed to handle rcu
> > > > callbacks, the sheaf can still be made available to other cpus on the
> > > > same node via the shared barn. The maple_node cache uses kfree_rcu() and
> > > > thus can benefit from this.
> > >
> > > Have you looked at fs/bcachefs/rcu_pending.c?

Next message: Jason Wang: "Re: [PATCH v6 4/6] vhost: introduce worker ops to support multiple thread models"
Previous message: Dapeng Mi: "[PATCH 2/2] perf tools/tests: Fix topdown groups test on hybrid platforms"
In reply to: Suren Baghdasaryan: "Re: [PATCH RFC v2 00/10] SLUB percpu sheaves"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]