Re: [PATCH] memcg, thp: do not invoke oom killer on thp charges

From: David Rientjes
Date: Thu Mar 22 2018 - 16:29:46 EST


On Thu, 22 Mar 2018, Michal Hocko wrote:

> > So now you're making a generalized policy change to the memcg charge path
> > to fix what is obviously only thp and caused by removing the __GFP_NORETRY
> > from thp allocations in commit 2516035499b9?
>
> Yes, because relying on __GFP_NORETRY for the oom handling has proven to
> be subtle and error prone. And as I've repeated few times already there
> is _no_ reason why the oom policy for the memcg charge should be any
> different from the allocator's one.
>

The PAGE_ALLOC_COSTLY_ORDER oom heuristic in the page allocator is about
the unlikelihood of freeing contiguous memory. It was added around the
same time I added the oom heuristic to prevent killing processes for
lowmem because of the unlikelihood of freeing lowmem.

If lowmem is accounted to a memcg, would you avoid oom kill in that
scenario too, just for the sake of matching the page allocator? Of course
not.

The argument is absurd, I'm sorry. Charging 32KB of memory allows the
memcg oom killer but magically 64KB of memory automatically fails and the
charging path has no ability to override that because you've made it part
of the charge path. There is no __GFP_RETRY. Making blanket claims that
high-order charges should never oom kill and that you know better than the
caller, and that you don't think the caller knows what they are doing wrt
__GFP_NORETRY, is more dangerous imo than the potential benefit from what
you are proposing.

> They simply cannot because kmalloc performs the change under the cover.
> So you would have to use kmalloc(gfp|__GFP_NORETRY) to be absolutely
> sure to not trigger _any_ oom killer. This is just wrong thing to do.
>

Examples of where this isn't already done? It certainly wasn't a problem
before __GFP_NORETRY was dropped in commit 2516035499b9 but you suspect
it's a problem now.

> > You're diverging from it because the memcg charge path has never had this
> > heuristic.
>
> Which is arguably a bug which just didn't matter because we do not
> have costly order oom eligible charges in general and THP was subtly
> different and turned out to be error prone.
>

It was inadvertently dropped from commit 2516035499b9. There were no
high-order charge oom kill problems before this commit. People know how
to use __GFP_NORETRY or leave it off, which you don't trust them to do
because you're hardcoding a heuristic in the charge path. You also acked
the commit that introduced this "error prone" problem. Before you start
to advertise that you know better than what previously worked just fine,
let's fix the issue that was introduced by that commit and then you can
propose a follow-up patch that changes the charge heuristic and it can
stand on its own merits.

> > I'm somewhat stunned this has to be repeated:
> > PAGE_ALLOC_COSTLY_ORDER is about the ability to allocate _contiguous_
> > memory, it's not about the _amount_ of memory. Changing the memcg charge
> > path to factor order into oom kill decisions is new, and should be
> > proposed as a follow-up patch to my bug fix to describe what else is being
> > impacted by your patch and what is fixed by it.
> >
> > Yours is a heuristic change, mine is a bug fix.
>
> Nobody is really arguing about this. I have just pointed out my
> reservation that your bug fix is adding more special casing and a more
> generic solution is due.

Dude, it's setting a bit that the problem commit dropped. That's it. I'm
setting a bit.

> If you absolutely believe that your bugfix is
> so important to make it to rc7 I will not object. It is however strange
> that we haven't seen a _single_ bug report in last two years about this
> being a problem. So I am not really sure the urgency is due.
>

You're not really sure about the urgency but you were "tempted to mark
this for stable" for your heuristic "fix"?

> > Your change is broken and I wouldn't push it to Linus for rc7 if my life
> > depended on it. What is the response when someone complains that they
> > start getting a ton of MEMCG_OOM notifications for every thp fallback?
> > They will, because yours is a broken implementation.
>
> I fail to see what is broken. Could you be more specific?
>

I said MEMCG_OOM notifications on thp fallback. You modified
mem_cgroup_oom(). What is called before mem_cgroup_oom()?
mem_cgroup_event(mem_over_limit, MEMCG_OOM). That increments the
MEMCG_OOM event and anybody waiting on the events file gets notified it
changed. They read a MEMCG_OOM event. It's thp fallback, it's not memcg
oom.

Really, I can't continue to write 100 emails in this thread. I'm sorry,
but there are only so many hours in a day. I can't read 20 emails about
why Tetsuo shouldn't emit a stack trace when the oom reaper fails, which
happens 0.04% of the time on production workloads with data I provided. I
can't continue to reiterate why we added the PAGE_ALLOC_COSTLY_ORDER
heuristic to the allocator. I'm done.

> > Respectfully, allow the bugfix to fix what was obviously left off from
> > commit 2516035499b9.
>
> I haven't nacked the patch AFAIR so nothing really prevents it from
> being merged.
>

It should be merged for rc7. Please send any follow-up policy change wrt
to high-order charges as a separate patch for 4.17.