Re: [PATCH] mm: clear 1G pages with streaming stores on x86

From: Kirill A. Shutemov
Date: Mon Mar 16 2020 - 08:20:02 EST


On Mon, Mar 16, 2020 at 11:18:56AM +0100, Michal Hocko wrote:
> On Wed 11-03-20 03:54:47, Kirill A. Shutemov wrote:
> > On Tue, Mar 10, 2020 at 05:21:30PM -0700, Cannon Matthews wrote:
> > > On Mon, Mar 9, 2020 at 11:37 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Mar 09, 2020 at 08:38:31AM -0700, Andi Kleen wrote:
> > > > > > Gigantic huge pages are a bit different. They are much less dynamic from
> > > > > > the usage POV in my experience. Micro-optimizations for the first access
> > > > > > tends to not matter at all as it is usually pre-allocation scenario. On
> > > > > > the other hand, speeding up the initialization sounds like a good thing
> > > > > > in general. It will be a single time benefit but if the additional code
> > > > > > is not hard to maintain then I would be inclined to take it even with
> > > > > > "artificial" numbers state above. There really shouldn't be other downsides
> > > > > > except for the code maintenance, right?
> > > > >
> > > > > There's a cautious tale of the old crappy RAID5 XOR assembler functions which
> > > > > were optimized a long time ago for the Pentium1, and stayed around,
> > > > > even though the compiler could actually do a better job.
> > > > >
> > > > > String instructions are constantly improving in performance (Broadwell is
> > > > > very old at this point) Most likely over time (and maybe even today
> > > > > on newer CPUs) you would need much more sophisticated unrolled MOVNTI variants
> > > > > (or maybe even AVX-*) to be competitive.
> > > >
> > > > Presumably you have access to current and maybe even some unreleased
> > > > CPUs ... I mean, he's posted the patches, so you can test this hypothesis.
> > >
> > > I don't have the data at hand, but could reproduce it if strongly
> > > desired, but I've also tested this on skylake and cascade lake, and
> > > we've had success running with this for a while now.
> > >
> > > When developing this originally, I tested all of this compared with
> > > AVX-* instructions as well as the string ops, they all seemed to be
> > > functionally equivalent, and all were beat out by this MOVNTI thing for
> > > large regions of 1G pages.
> > >
> > > There is probably room to further optimize the MOVNTI stuff with better
> > > loop unrolling or optimizations, if anyone has specific suggestions I'm
> > > happy to try to incorporate them, but this has shown to be effective as
> > > written so far, and I think I lack that assembly expertise to micro
> > > optimize further on my own.
> >
> > Andi's point is that string instructions might be a better bet in a long
> > run. You may win something with MOVNTI on current CPUs, but it may become
> > a burden on newer microarchitectures when string instructions improves.
> > Nobody realistically would re-validate if MOVNTI microoptimazation still
> > make sense for every new microarchitecture.
>
> While this might be true, isn't that easily solveable by the existing
> ALTERNATIVE and cpu features framework. Can we have a feature bit to
> tell that movnti is worthwile for large data copy routines. Probably
> something for x86 maintainers.

It still need somody to test which approach is better for the CPU.
See X86_FEATURE_REP_GOOD.

--
Kirill A. Shutemov