Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

From: Christoph Lameter
Date: Thu May 18 2017 - 13:07:32 EST


On Thu, 18 May 2017, Vlastimil Babka wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> No, that expand/shrink by itself doesn't work against parallel

Parallel? I think we are clear that ithis is inherently racy against the
app changing policies etc etc? There is a huge issue there already. The
app needs to be well behaved in some heretofore undefined way in order to
make moves clean.

> get_page_from_freelist going through a zonelist. Moving from node 0 to
> 1, with zonelist containing nodes 1 and 0 in that order:
>
> - mempolicy mask is 0
> - zonelist iteration checks node 1, it's not allowed, skip

There is an allocation from node 1? This is not allowed before the move.
So it should fail. Not skipping to another node.

> - mempolicy mask is 0,1 (expand)
> - mempolicy mask is 1 (shrink)
> - zonelist iteration checks node 0, it's not allowed, skip
> - OOM

Are you talking about a race here between zonelist scanning and the
moving? That has been there forever.

And frankly there are gazillions of these races. The best thing to do is
to get the cpuset moving logic out of the kernel and into user space.

Understand that this is a heuristic and maybe come up with a list of
restrictions that make an app safe. An safe app that can be moved must f.e

1. Not allocate new memory while its being moved
2. Not change memory policies after its initialization and while its being
moved.
3. Not save memory policy state in some variable (because the logic to
translate the memory policies for the new context cannot find it).

...

Again cpuset process migration is a huge mess that you do not want to
have in the kernel and AFAICT this is a corner case with difficult
semantics. Better have that in user space...