Re: [PATCH v2 0/4] mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE

From: SeongJae Park
Date: Tue Apr 08 2025 - 16:23:42 EST


On Tue, 8 Apr 2025 14:44:40 +0100 Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote:

> On Fri, Apr 04, 2025 at 02:06:56PM -0700, SeongJae Park wrote:
> > When process_madvise() is called to do MADV_DONTNEED[_LOCKED] or
> > MADV_FREE with multiple address ranges, tlb flushes happen for each of
> > the given address ranges. Because such tlb flushes are for same
>
> Nit: for _the_ same.

Thank you for kindly finding and suggesting fixes for these mistakes. I will
update following your suggestions here and below.

[...]
> > Similar optimizations might be applicable to other madvise behaviros
>
> Typo: behaviros -> behavior (or 'behaviors', but since behavior is already plural
> probably unnecessary).
>
> > such as MADV_COLD and MADV_PAGEOUT. Those are simply out of the scope
> > of this patch series, though.
>
> Well well, for now :)

Yes. Hopefully we will have another chance to further improve the cases.

[...]
> > Test Results
> > ============
> >
> > I measured the latency to apply MADV_DONTNEED advice to 256 MiB memory
> > using multiple process_madvise() calls. I apply the advice in 4 KiB
> > sized regions granularity, but with varying batch size per
> > process_madvise() call (vlen) from 1 to 1024. The source code for the
> > measurement is available at GitHub[1]. To reduce measurement errors, I
> > did the measurement five times.
>
> Be interesting to see how this behaves with mTHP sizing too! But probably a bit
> out of scope perhaps.

Obviously we have many more rooms to explore and get fun :)

>
> >
> > The measurement results are as below. 'sz_batch' column shows the batch
> > size of process_madvise() calls. 'Before' and 'After' columns show the
> > average of latencies in nanoseconds that measured five times on kernels
> > that built without and with the tlb flushes batching of this series
> > (patches 3 and 4), respectively. For the baseline, mm-new tree of
> > 2025-04-04[2] has been used. 'B-stdev' and 'A-stdev' columns show
> > ratios of latency measurements standard deviation to average in percent
> > for 'Before' and 'After', respectively. 'Latency_reduction' shows the
> > reduction of the latency that the 'After' has achieved compared to
> > 'Before', in percent. Higher 'Latency_reduction' values mean more
> > efficiency improvements.
> >
> > sz_batch Before B-stdev After A-stdev Latency_reduction
> > 1 110948138.2 5.55 109476402.8 4.28 1.33
> > 2 75678535.6 1.67 70470722.2 3.55 6.88
> > 4 59530647.6 4.77 51735606.6 3.44 13.09
> > 8 50013051.6 4.39 44377029.8 5.20 11.27
> > 16 48657878.2 9.32 37291600.4 3.39 23.36
> > 32 43614180.2 6.06 34127428 3.75 21.75
> > 64 42466694.2 5.70 26737935.2 2.54 37.04
> > 128 42977970 6.99 25050444.2 4.06 41.71
> > 256 41549546 1.88 24644375.8 3.77 40.69
> > 512 42162698.6 6.17 24068224.8 2.87 42.92
> > 1024 40978574 5.44 23872024.2 3.65 41.75
>
> Very nice! Great work.
>
> >
> > As expected, tlb flushes batching provides latency reduction that
> > proportional to the batch size. The efficiency gain ranges from about
> > 6.88 percent with batch size 2, to about 40 percent with batch size 128.
> >
> > Please note that this is a very simple microbenchmark, so real
> > efficiency gain on real workload could be very different.
>
> Indeed, accepted, but it makes a great deal of sense to batch these operations,
> especially when we get to the point of actually increasing the process_madvise()
> iov size.

Cannot agree more.

Thank you for your kind review with great suggestions for this patchset. I
will post the next spin with the suggested changes, soon.


Thanks,
SJ

[...]