Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

From: Michal Hocko
Date: Thu Sep 06 2018 - 01:58:00 EST


On Thu 06-09-18 10:00:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 05-09-18 22:53:33, Tetsuo Handa wrote:
> > > On 2018/09/05 22:40, Michal Hocko wrote:
> > > > Changelog said
> > > >
> > > > "Although this is possible in principle let's wait for it to actually
> > > > happen in real life before we make the locking more complex again."
> > > >
> > > > So what is the real life workload that hits it? The log you have pasted
> > > > below doesn't tell much.
> > >
> > > Nothing special. I just ran a multi-threaded memory eater on a CONFIG_PREEMPT=y kernel.
> >
> > I strongly suspec that your test doesn't really represent or simulate
> > any real and useful workload. Sure it triggers a rare race and we kill
> > another oom victim. Does this warrant to make the code more complex?
> > Well, I am not convinced, as I've said countless times.
>
> Yes. Below is an example from a machine running Apache Web server/Tomcat AP server/PostgreSQL DB server.
> An memory eater needlessly killed Tomcat due to this race.

What prevents you from modifying you mem eater in a way that Tomcat
resp. others from being the primary oom victim choice? In other words,
yeah it is not optimal to lose the race but if it is rare enough then
this is something to live with because it can be hardly considered a
new DoS vector AFAICS. Remember that this is always going to be racy
land and we are not going to plumb all possible races because this is
simply not viable. But I am pretty sure we have been through all this
many times already. Oh well...

> I assert that we should fix af5679fbc669f31f.

If you can come up with reasonable patch which doesn't complicate the
code and it is a clear win for both this particular workload as well as
others then why not.
--
Michal Hocko
SUSE Labs