Re: [PATCH 4/4] mm/filemap: optimize filemap folio adding

From: Kairui Song
Date: Thu Mar 21 2024 - 14:36:21 EST


On Wed, Mar 20, 2024 at 5:06 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Wed, Mar 20, 2024 at 6:20 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Mar 19, 2024 at 05:27:33PM +0800, Kairui Song wrote:
> > > From: Kairui Song <kasong@xxxxxxxxxxx>
> > >
> > > Instead of doing multiple tree walks, do one optimism range check
> > > with lock hold, and exit if raced with another insertion. If a shadow
> > > exists, check it with a new xas_get_order helper before releasing the
> > > lock to avoid redundant tree walks for getting its order.
> > >
> > > Drop the lock and do the allocation only if a split is needed.
> > >
> > > In the best case, it only need to walk the tree once. If it needs
> > > to alloc and split, 3 walks are issued (One for first ranced
> > > conflict check and order retrieving, one for the second check after
> > > allocation, one for the insert after split).
> > >
> > > Testing with 4k pages, in an 8G cgroup, with 20G brd as block device:
> > >
> > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > > --buffered=1 --ioengine=mmap --rw=randread --time_based \
> > > --ramp_time=30s --runtime=5m --group_reporting
> > >
> > > Before:
> > > bw ( MiB/s): min= 790, max= 3665, per=100.00%, avg=2499.17, stdev=20.64, samples=8698
> > > iops : min=202295, max=938417, avg=639785.81, stdev=5284.08, samples=8698
> > >
> > > After (+4%):
> > > bw ( MiB/s): min= 451, max= 3868, per=100.00%, avg=2599.83, stdev=23.39, samples=8653
> > > iops : min=115596, max=990364, avg=665556.34, stdev=5988.20, samples=8653
> > >
> > > Test result with THP (do a THP randread then switch to 4K page in hope it
> > > issues a lot of splitting):
> > >
> > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > > --buffered=1 --ioengine mmap -thp=1 --readonly \
> > > --rw=randread --random_distribution=random \
> > > --time_based --runtime=5m --group_reporting
> > >
> > > fio -name=cached --numjobs=16 --filename=/mnt/test.img \
> > > --buffered=1 --ioengine mmap --readonly \
> > > --rw=randread --random_distribution=random \
> > > --time_based --runtime=5s --group_reporting
> > >
> > > Before:
> > > bw ( KiB/s): min=28071, max=62359, per=100.00%, avg=53542.44, stdev=179.77, samples=9520
> > > iops : min= 7012, max=15586, avg=13379.39, stdev=44.94, samples=9520
> > > bw ( MiB/s): min= 2457, max= 6193, per=100.00%, avg=3923.21, stdev=82.48, samples=144
> > > iops : min=629220, max=1585642, avg=1004340.78, stdev=21116.07, samples=144
> > >
> > > After (+-0.0%):
> > > bw ( KiB/s): min=30561, max=63064, per=100.00%, avg=53635.82, stdev=177.21, samples=9520
> > > iops : min= 7636, max=15762, avg=13402.82, stdev=44.29, samples=9520
> > > bw ( MiB/s): min= 2449, max= 6145, per=100.00%, avg=3914.68, stdev=81.15, samples=144
> > > iops : min=627106, max=1573156, avg=1002158.11, stdev=20774.77, samples=144
> > >
> > > The performance is better (+4%) for 4K cached read and unchanged for THP.
> > >
> > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> > > ---
> > > mm/filemap.c | 127 ++++++++++++++++++++++++++++++---------------------
> > > 1 file changed, 76 insertions(+), 51 deletions(-)
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 6bbec8783793..c1484bcdbddb 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -848,12 +848,77 @@ void replace_page_cache_folio(struct folio *old, struct folio *new)
> > > }
> > > EXPORT_SYMBOL_GPL(replace_page_cache_folio);
> > >
> > > +static int __split_add_folio_locked(struct xa_state *xas, struct folio *folio,
> > > + pgoff_t index, gfp_t gfp, void **shadowp)
> >
>
> Thanks for the very helpful review!
>
> > I don't love the name of this function. Splitting is a rare thing that
> > it does. I'd suggest it's more filemap_store().
>
> Yes, the function name is a bit misleading indeed, I can rename it as
> you suggested, eg. __filemap_store_locked ?
>
> >
> > > +{
> > > + void *entry, *shadow, *alloced_shadow = NULL;
> > > + int order, alloced_order = 0;
> > > +
> > > + gfp &= GFP_RECLAIM_MASK;
> > > + for (;;) {
> > > + shadow = NULL;
> > > + order = 0;
> > > +
> > > + xas_for_each_conflict(xas, entry) {
> > > + if (!xa_is_value(entry))
> > > + return -EEXIST;
> > > + shadow = entry;
> > > + }
> > > +
> > > + if (shadow) {
> > > + if (shadow == xas_reload(xas)) {
> >
> > Why do you need the xas_reload here?
>
> This part of code works based on the guarantee that If there is a
> larger entry, it will be the first and only entry iterated by
> xas_for_each_conflict/xas_find_conflict. I added an xas_reload is here
> to ensure that. But on second thought, this seems not needed indeed.
>
> Will it be better if I write this part this way?
>
> + shadow = NULL;
> + order = -1;
> + xas_for_each_conflict(xas, entry) {
> + if (!xa_is_value(entry))
> + return -EEXIST;

I noticed this should release potential alloced xas data, will fix this in v2.

> + /*
> + * If there is a larger entry, it will be the first
> + * and only entry iterated.
> + */
> + if (order == -1)
> + order = xas_get_order(xas);
> + shadow = entry;
> + }
> +
> + if (shadow) {
> + /* check if alloc & split need, or if previous alloc is
> still valid */
> + if (order > 0 && order > folio_order(folio)) {
> + if (shadow != alloced_shadow || order != alloced_order)
> + goto unlock;
> + xas_split(xas, shadow, order);
> + xas_reset(xas);
> + }
> + order = -1;
> + if (shadowp)
> + *shadowp = shadow;
> + }
>

Besides this, this should be OK? I think I can add more tests for
xas_for_each_conflict and xas_get_order to ensure this works, need to
export xas_get_order as GPL symbol for that.

>
> If there is a larger slot, xas_for_each_conflict and check above
> should catch that?
>
> >
> > > + if (shadowp)
> > > + *shadowp = shadow;
> > > + }
> > > +
> > > + xas_store(xas, folio);
> > > + /* Success, return with mapping locked */
> > > + if (!xas_error(xas))
> > > + return 0;
> > > +unlock:
> > > + /*
> > > + * Unlock path, if errored, return unlocked.
> > > + * If allocation needed, alloc and retry.
> > > + */
> > > + xas_unlock_irq(xas);
> > > + if (order) {
> > > + if (unlikely(alloced_order))
> > > + xas_destroy(xas);
> > > + xas_split_alloc(xas, shadow, order, gfp);
> > > + if (!xas_error(xas)) {
> > > + alloced_shadow = shadow;
> > > + alloced_order = order;
> > > + }
> > > + goto next;
> > > + }
> > > + /* xas_nomem result checked by xas_error below */
> > > + xas_nomem(xas, gfp);
> > > +next:
> > > + xas_lock_irq(xas);
> > > + if (xas_error(xas))
> > > + return xas_error(xas);
> > > +
> > > + xas_reset(xas);
> > > + }
> > > +}
> >
> > Splitting this out into a different function while changing the logic
> > really makes this hard to review ;-(
>
> Sorry about this :(
>
> This patch basically rewrites the logic of __filemap_add_folio and the
> function is getting long, so I thought it would be easier to
> understand if we split it out.
> I initially updated the code in place but that change diff seems more
> messy to me.
>
> >
> > I don't object to splitting the function, but maybe two patches; one
> > to move the logic and a second to change it?
> >
>
> I can keep it in place in V2.