Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

From: Alexander Duyck
Date: Thu May 07 2020 - 17:18:57 EST


On Thu, May 7, 2020 at 1:20 PM Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> wrote:
>
> On Thu, May 07, 2020 at 08:26:26AM -0700, Alexander Duyck wrote:
> > On Wed, May 6, 2020 at 3:39 PM Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> wrote:
> > > On Tue, May 05, 2020 at 08:27:52AM -0700, Alexander Duyck wrote:
> > > > > Maybe it's better to leave deferred_init_maxorder alone and adapt the
> > > > > multithreading to the existing implementation. That'd mean dealing with the
> > > > > pesky opaque index somehow, so deferred_init_mem_pfn_range_in_zone() could be
> > >
> > > I should have been explicit, was thinking of @i from
> > > () when mentioning the opaque index.
> >
> > Okay, that makes sense. However in reality you don't need to split
> > that piece out. All you really are doing is splitting up the
> > first_init_pfn value over multiple threads so you just need to make
> > use of deferred_init_mem_pfn_range_in_zone() to initialize it.
>
> Ok, I assume you mean that each thread should use
> deferred_init_mem_pfn_range_in_zone. Yes, that's what I meant when saying that
> function could be generalized, though not sure we should opt for this.

Yes that is what I meant.

> > > > > generalized to find it in the thread function based on the start/end range, or
> > > > > it could be maintained as part of the range that padata passes to the thread
> > > > > function.
> > > >
> > > > You may be better off just implementing your threads to operate like
> > > > deferred_grow_zone does. All your worker thread really needs then is
> > > > to know where to start performing the page initialization and then it
> > > > could go through and process an entire section worth of pages. The
> > > > other bit that would have to be changed is patch 6 so that you combine
> > > > any ranges that might span a single section instead of just splitting
> > > > the work up based on the ranges.
> > >
> > > How are you thinking of combining them? I don't see a way to do it without
> > > storing an arbitrary number of ranges somewhere for each thread.
> >
> > So when you are putting together your data you are storing a starting
> > value and a length. All you end up having to do is make certain that
> > the size + start pfn is section aligned. Then if you jump to a new
> > section you have the option of either adding to the size of your
> > current section or submitting the range and starting with a new start
> > pfn in a new section. All you are really doing is breaking up the
> > first_deferred_pfn over multiple sections. What I would do is section
> > align end_pfn, and then check the next range from the zone. If the
> > start_pfn of the next range is less than end_pfn you merge the two
> > ranges by just increasing the size, otherwise you could start a new
> > range.
> >
> > The idea is that you just want to define what the valid range of PFNs
> > are, and if there are sizable holes you skip over them. You would
> > leave most of the lifting for identifying exactly what PFNs to
> > initialize to the pfn_range_in_zone iterators since they would all be
> > read-only accesses anyway.
>
> Ok, I follow you. My assumption is that there are generally few free pfn
> ranges relative to the total number of pfns being initialized so that it's
> efficient to parallelize over a single pfn range from the zone iterator. On
> the systems I tested, there were about 20 tiny ranges and one enormous range
> per node so that firing off a job per range kept things simple without
> affecting performance. If that assumption holds, I'm not sure it's worth it to
> merge ranges.

The idea behind merging ranges it to address possible cases where a
range is broken up such that there is a hole in a max order block as a
result. By combining the ranges if they both span the same section we
can guarantee that the entire section will be initialized as a block
and not potentially have partially initialized sections floating
around. Without that mo_pfn logic I had in there I was getting panics
every so often when booting up one of my systems as I recall.

Also the iterator itself is cheap. It is basically just walking a
read-only list so it scales efficiently as well. One of the reasons
why I arranged the code the way I did is that it also allowed me to
get rid of an extra check in the code as the previous code was having
to verify if the pfn belonged to the node. That is all handled
directly through the for_each_free_mem_pfn_range_in_zone[_from] call
now.

> With the series as it stands plus leaving in the section alignment check in
> deferred_grow_zone (which I think could be relaxed to a maxorder alignment
> check) so it doesn't stop mid-max-order-block, threads simply deal with a
> start/end range and deferred_init_maxorder becomes shorter and simpler too.

I still think we are better off initializing complete sections since
the pageblock_flags are fully initialized that way as well. What
guarantee do you have that all of the memory ranges will be max order
aligned? The problem is we have to guarantee all pages are initialized
before we start freeing the pages in a max order page. If we just
process each block as-is I believe we can end up with some
architectures trying to access uninitialized memory in the buddy
allocator as a result. That is why the deferred_init_maxorder function
will walk through the iterator, using the _from version to avoid
unnecessary iteration, the first time initializing the pages it needs
to cross that max order boundary, and then again to free the max order
block of pages that have been initialized. The iterator itself is
farily cheap and only has to get you through the smaller ranges before
you end up at the one big range that it just kind of sits at while it
is working on getting it processed.