Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

From: Michal Hocko
Date: Fri Nov 06 2020 - 03:10:30 EST


On Fri 06-11-20 15:06:56, Feng Tang wrote:
> On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote:
> > On Thu 05-11-20 21:43:05, Feng Tang wrote:
> > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > > > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > > > [...]
> > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > > >
> > > > Can you provide the full information please? Which node has been
> > > > requested. Which cpuset the calling process run in and which node has
> > > > the allocation succeeded from? A bare dump_stack without any further
> > > > context is not really helpful.
> > >
> > > I don't have the same platform as the original report, so I simulated
> > > one similar setup (with fakenuma and movablecore), which has 2 memory
> > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> > > Movable zone. With it, I can got the same error and same oom callstack
> > > as the original report (as in the cover-letter).
> > >
> > > The test command is:
> > > # docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> > >
> > > To debug I only added some trace in the __alloc_pages_nodemask(), and
> > > for the callstack which get the page successfully:
> > >
> > > [ 567.510903] Call Trace:
> > > [ 567.510909] dump_stack+0x74/0x9a
> > > [ 567.510910] __alloc_pages_nodemask.cold+0x22/0xe5
> > > [ 567.510913] alloc_pages_current+0x87/0xe0
> > > [ 567.510914] __vmalloc_node_range+0x14c/0x240
> > > [ 567.510918] module_alloc+0x82/0xe0
> > > [ 567.510921] bpf_jit_alloc_exec+0xe/0x10
> > > [ 567.510922] bpf_jit_binary_alloc+0x7a/0x120
> > > [ 567.510925] bpf_int_jit_compile+0x145/0x424
> > > [ 567.510926] bpf_prog_select_runtime+0xac/0x130
> >
> > As already said this doesn't really tell much without the additional
> > information.
> >
> > > The incomming parameter nodemask is NULL, and the function will first try the
> > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > > which will first set the nodemask to 'NULL', and this time it got a preferred
> > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > > one page from that zone.
> >
> > I do not follow. Both hot and slow paths of the allocator set
> > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> > IIRC. This is later enforced in get_page_from_free_list. There are some
> > exceptions when the allocating process can run away from its cpusets -
> > e.g. IRQs, OOM victims and few other cases but definitely not a random
> > allocation. There might be some subtle details that have changed or I
> > might have forgot but
>
> yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
> is
>
> if (cpusets_enabled() &&
> (alloc_flags & ALLOC_CPUSET) &&
> !__cpuset_zone_allowed(zone, gfp_mask))
>
> In our case (kernel page got allocated), the first 2 conditions are true,
> and for __cpuset_zone_allowed(), the possible place to return true is
> checking parent cpuset's nodemask
>
> cs = nearest_hardwall_ancestor(task_cs(current));
> allowed = node_isset(node, cs->mems_allowed);
>
> This will override the ALLOC_CPUSET check.

Yes and this is ok because that is defined hierarchical semantic of the
cpusets which applies to any !hardwalled allocation. Cpusets are quite
non intuitive. Re-reading the previous discussion I have realized that
me trying to not go into those details might have mislead you. Let me
try again and clarify that now.

I was talking in context of the patch you are proposing and that is a
clear violation of the cpuset isolation. Especially for hardwalled
setups because it allows to spill over to other nodes which shouldn't be
possible except for few exceptions which shouldn't generate a lot of
allocations (e.g. oom victim exiting, IRQ context).

What I was not talking about, and should have been more clear about, is
that without hardwall resp. exclusive nodes the isolation is best effort
only for most kernel allocation requests (or more specifically those
without __GFP_HARDWALL). Your patch doesn't distinguish between those
and any non movable allocations and effectively allowed to runaway even
for hardwalled allocations which are not movable. Those can be controlled
by userspace very easily.

I hope this clarifies it a bit more and sorry if I mislead you.
--
Michal Hocko
SUSE Labs