Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists

From: Michal Hocko
Date: Fri Jul 14 2017 - 07:45:24 EST


On Fri 14-07-17 13:43:21, Michal Hocko wrote:
> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> > On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@xxxxxxxx>
> > >
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > >
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> >
> > Given the experience with with cpusets and mempolicies, I would rather
> > avoid the risk of allocation not seeing the only zone(s) that are
> > allowed by its nodemask, and triggering premature OOM.
>
> I would argue, those are a different beast because they are directly
> under control of not fully priviledged user and change between the empty
> nodemask and cpusets very often. For this one to trigger we
> would have to online/offline the last memory block in the zone very
> often and that doesn't resemble a sensible usecase even remotely.
>
> > So maybe the
> > updates could be done in a way to avoid that, e.g. first append a copy
> > of the old zonelist to the end, then overwrite and terminate with NULL.
> > But if this requires any barriers or something similar on the iteration
> > site, which is performance critical, then it's bad.
> > Maybe a seqcount, that the iteration side only starts checking in the
> > slowpath? Like we have with cpusets now.
> > I know that Mel noted that stop_machine() also never had such guarantees
> > to prevent this, but it could have made the chances smaller.
>
> I think we can come up with some scheme but is this really worth it
> considering how unlikely the whole thing is? Well, if somebody hits a
> premature OOM killer or allocations failures it would have to be along
> with a heavy memory hotplug operations and then it would be quite easy
> to spot what is going on and try to fix it. I would rather not
> overcomplicate it, to be honest.

And one more thing, Mel has already brought this up in his response.
stop_machine haven't is very roughly same strenght wrt. double zone
visit or a missed zone because we do not restart zonelist iteration.
--
Michal Hocko
SUSE Labs