Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

From: Daniel Jordan
Date: Thu May 07 2020 - 16:21:33 EST


On Thu, May 07, 2020 at 08:26:26AM -0700, Alexander Duyck wrote:
> On Wed, May 6, 2020 at 3:39 PM Daniel Jordan <daniel.m.jordan@xxxxxxxxxx> wrote:
> > On Tue, May 05, 2020 at 08:27:52AM -0700, Alexander Duyck wrote:
> > > > Maybe it's better to leave deferred_init_maxorder alone and adapt the
> > > > multithreading to the existing implementation. That'd mean dealing with the
> > > > pesky opaque index somehow, so deferred_init_mem_pfn_range_in_zone() could be
> >
> > I should have been explicit, was thinking of @i from
> > () when mentioning the opaque index.
>
> Okay, that makes sense. However in reality you don't need to split
> that piece out. All you really are doing is splitting up the
> first_init_pfn value over multiple threads so you just need to make
> use of deferred_init_mem_pfn_range_in_zone() to initialize it.

Ok, I assume you mean that each thread should use
deferred_init_mem_pfn_range_in_zone. Yes, that's what I meant when saying that
function could be generalized, though not sure we should opt for this.

> > > > generalized to find it in the thread function based on the start/end range, or
> > > > it could be maintained as part of the range that padata passes to the thread
> > > > function.
> > >
> > > You may be better off just implementing your threads to operate like
> > > deferred_grow_zone does. All your worker thread really needs then is
> > > to know where to start performing the page initialization and then it
> > > could go through and process an entire section worth of pages. The
> > > other bit that would have to be changed is patch 6 so that you combine
> > > any ranges that might span a single section instead of just splitting
> > > the work up based on the ranges.
> >
> > How are you thinking of combining them? I don't see a way to do it without
> > storing an arbitrary number of ranges somewhere for each thread.
>
> So when you are putting together your data you are storing a starting
> value and a length. All you end up having to do is make certain that
> the size + start pfn is section aligned. Then if you jump to a new
> section you have the option of either adding to the size of your
> current section or submitting the range and starting with a new start
> pfn in a new section. All you are really doing is breaking up the
> first_deferred_pfn over multiple sections. What I would do is section
> align end_pfn, and then check the next range from the zone. If the
> start_pfn of the next range is less than end_pfn you merge the two
> ranges by just increasing the size, otherwise you could start a new
> range.
>
> The idea is that you just want to define what the valid range of PFNs
> are, and if there are sizable holes you skip over them. You would
> leave most of the lifting for identifying exactly what PFNs to
> initialize to the pfn_range_in_zone iterators since they would all be
> read-only accesses anyway.

Ok, I follow you. My assumption is that there are generally few free pfn
ranges relative to the total number of pfns being initialized so that it's
efficient to parallelize over a single pfn range from the zone iterator. On
the systems I tested, there were about 20 tiny ranges and one enormous range
per node so that firing off a job per range kept things simple without
affecting performance. If that assumption holds, I'm not sure it's worth it to
merge ranges.

With the series as it stands plus leaving in the section alignment check in
deferred_grow_zone (which I think could be relaxed to a maxorder alignment
check) so it doesn't stop mid-max-order-block, threads simply deal with a
start/end range and deferred_init_maxorder becomes shorter and simpler too.