Re: [RFC 1/3] oom, sysrq: Skip over oom victims and killed tasks

From: Michal Hocko
Date: Wed Jan 20 2016 - 04:49:57 EST


On Tue 19-01-16 14:57:33, David Rientjes wrote:
> On Fri, 15 Jan 2016, Michal Hocko wrote:
>
> > > I think it's time to kill sysrq+F and I'll send those two patches
> > > unless there is a usecase I'm not aware of.
> >
> > I have described one in the part you haven't quoted here. Let me repeat:
> > : Your system might be trashing to the point you are not able to log in
> > : and resolve the situation in a reasonable time yet you are still not
> > : OOM. sysrq+f is your only choice then.
> >
> > Could you clarify why it is better to ditch a potentially usefull
> > emergency tool rather than to make it work reliably and predictably?
>
> I'm concerned about your usecase where the kernel requires admin
> intervention to resolve such an issue and there is nothing in the VM we
> can do to fix it.
>
> If you have a specific test that demonstrates when your usecase is needed,
> please provide it so we can address the issue that it triggers.

No, I do not have a specific load in mind. But let's be realistic. There
will _always_ be corner cases where the VM cannot react properly or in a
timely fashion.

> I'd prefer to fix the issue in the VM rather than require human
> intervention, especially when we try to keep a very large number of
> machines running in our datacenters.

It is always preferable to resolve the mm related issue automagically,
of course. We should strive for robustness as much as possible but that
doesn't mean we should get the only emergency tool out of administrator
hands.

To be honest I really fail to understand your line of argumentation
here. Just that you think that sysrq+f might be not helpful in large
datacenters which you seem to care about, doesn't mean that it is not
helpful in other setups.

Removing the functionality is out of question IMHO so can we please
start discussing how to make it more predictable please?
--
Michal Hocko
SUSE Labs