Possible fix approach
Cpuset updates will rebind nodemasks only of those mempolicies that need it wrt
their relative nodes semantics (those are either created with the flag
MPOL_F_RELATIVE_NODES, or with neither RELATIVE nor STATIC flag). The others
(created with the STATIC flag) we can leave untouched. For mempolicies that we
keep rebinding, adopt the approach of mbind() that swaps an updated copy
instead of in-place changes. We can leave get_page_from_freelist() as it is and
nodes will be filtered orthogonally with mempolicy nodemask and cpuset check.
This will give us stable nodemask throughout the whole allocation without a
need for an on-stack copy. The next question is what to do with
current->mems_allowed. Do we keep the parallel modifications with seqlock
protection or e.g. try to go back to the synchronous copy approach?
Related to that is a remaining corner case with alloc_pages_vma() which has its
own seqlock-protected scope. There it calls policy_nodemask() which might
detect that there's no intersection between the mempolicy and cpuset and return
NULL nodemask. However, __alloc_pages_slowpath() has own seqlock scope, so if a
modification to mems_allowed (resulting in no intersection with mempolicy)
happens between the check in policy_nodemask() and reaching
__alloc_pages_slowpath(), the latter won't detect the modification and invoke
OOM before it can return with a failed allocation to alloc_pages_vma() and let
it detect a seqlock update and retry. One solution as shown in the RFC patch [3]
is to add another check for the cpuset/nodemask intersection before OOM. That
works, but it's a bit hacky and still produces an allocation failure warning.
On the other hand, we might also want to make things more robust in general and
prevent spurious OOMs due to no nodes being eligible for also any other reason,
such as buggy driver passing a wrong nodemask (which doesn't necessarily come
from a mempolicy).