Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

From: Vlastimil Babka
Date: Thu Nov 05 2020 - 08:14:29 EST


On 11/5/20 1:58 PM, Michal Hocko wrote:
On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
On 11/5/20 1:08 PM, Michal Hocko wrote:
> On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > Could you be more specific? This sounds like a bug. Allocations
> > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > > I mean when the docker starts, it will spawn many processes which obey
> > the mem binding set, and they have some kernel page requests, which got
> > successfully allocated, like the following callstack:
> > > > [ 567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G W I 5.9.0-rc8+ #6
> > [ 567.044956] Hardware name: /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > [ 567.044958] Call Trace:
> > [ 567.044972] dump_stack+0x74/0x9a
> > [ 567.044978] __alloc_pages_nodemask.cold+0x22/0xe5
> > [ 567.044986] alloc_pages_current+0x87/0xe0
> > [ 567.044991] allocate_slab+0x2e5/0x4f0
> > [ 567.044996] ___slab_alloc+0x380/0x5d0
> > [ 567.045021] __slab_alloc+0x20/0x40
> > [ 567.045025] kmem_cache_alloc+0x2a0/0x2e0
> > [ 567.045033] mqueue_alloc_inode+0x1a/0x30
> > [ 567.045041] alloc_inode+0x22/0xa0
> > [ 567.045045] new_inode_pseudo+0x12/0x60
> > [ 567.045049] new_inode+0x17/0x30
> > [ 567.045052] mqueue_get_inode+0x45/0x3b0
> > [ 567.045060] mqueue_fill_super+0x41/0x70
> > [ 567.045067] vfs_get_super+0x7f/0x100
> > [ 567.045074] get_tree_keyed+0x1d/0x20
> > [ 567.045080] mqueue_get_tree+0x1c/0x20
> > [ 567.045086] vfs_get_tree+0x2a/0xc0
> > [ 567.045092] fc_mount+0x13/0x50
> > [ 567.045099] mq_create_mount+0x92/0xe0
> > [ 567.045102] mq_init_ns+0x3b/0x50
> > [ 567.045106] copy_ipcs+0x10a/0x1b0
> > [ 567.045113] create_new_namespaces+0xa6/0x2b0
> > [ 567.045118] unshare_nsproxy_namespaces+0x5a/0xb0
> > [ 567.045124] ksys_unshare+0x19f/0x360
> > [ 567.045129] __x64_sys_unshare+0x12/0x20
> > [ 567.045135] do_syscall_64+0x38/0x90
> > [ 567.045143] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > For it, the __alloc_pages_nodemask() will first try process's targed
> > nodemask(unmovable node here), and there is no availabe zone, so it
> > goes with the NULL nodemask, and get a page in the slowpath.
> > OK, I see your point now. I was not aware of the slab allocator not
> following cpusets. Sounds like a bug to me.

SLAB and SLUB seem to not care about cpusets in the fast path.

Is a fallback to a different node which is outside of the cpuset
possible?

AFAICS anything in per-cpu cache will be allocated without looking at the cpuset, so it can be outside of the cpuset. In SLUB slowpath, get_partial_node() looking for fallback on the same node will also not look at cpuset. get_any_partial() looking for a fallback allocation on any node does check cpuset_zone_allowed() and obey it strictly. A fallback to page allocator will obey whatever page allocator obeys.

So if a process cannot is restricted to allocate from node X via cpuset *and* also cannot be executed on CPU's from node X via taskset, then it AFAICS effectively cannot violate the cpuset in SLUB because it won't reach the percpu or per-node caches that don't check cpusets.

But this
stack shows that it went all the way to the page allocator, so the cpusets
should have been obeyed there at least.

Looking closer what is this dump_stack saying actually?

Yes, is that a dump of successful allocation (that violates cpusets?) or a failing one?