Re: How to make warn_alloc() reliable?

From: Tetsuo Handa
Date: Thu Oct 20 2016 - 08:08:41 EST

Michal Hocko wrote:
> On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
> [...]
> > What I'm talking about is "why don't you stop playing whack-a-mole games
> > with missing warn_alloc() calls". I don't blame you for not having a good
> > idea, but I blame you for not having a reliable warn_alloc() mechanism.
> Look, it seems pretty clear that our priorities and viewes are quite
> different. While I believe that we should solve real issues in a
> reliable and robust way you seem to love to be have as much reporting as
> possible. I do agree that reporting is important part of debugging of
> problems but as your previous attempts for the allocation watchdog show
> a proper and bullet proof reporting requires state tracking and is in
> general too complex for something that doesn't happen in most properly
> configured systems. Maybe there are other ways but my time is better
> spent on something more useful - like making the direct reclaim path
> more deterministic without any unbound loops.

Properly configured systems should not be bothered by low memory situations.
There are systems which are bothered by low memory situations. It is pointless
to refer to "properly configured systems" as a reason not to add a watchdog.
It is administrators who decide whether to use a watchdog.

> So let's agree to disagree about importance of the reliability
> warn_alloc. I see it as an improvement which doesn't really have to be
> perfect.

I don't expect kmallocwd alone to be perfect. I expect kmallocwd to serve
as a hook. For example, it will be possible to turn on collecting perf data
when kmallocwd found a stalling thread and turn off when kmallocwd found
none. Since necessary information are stored in the task struct, it will
be easy to include them into perf data. Likewise, it will be easy to
extract them using a script for /usr/bin/crash when an administrator
captured a vmcore image of a stalling KVM guest. Sending vmcore images
to support centers is difficult due to file size and security reasons.
It is nice if we can get a clue by reading the task list.

But warn_alloc() can't serve as a hook. I see kmallocwd as an improvement
which doesn't really have to be perfect.

By the way, regarding "making the direct reclaim path more deterministic"
part, I wish that we can

(1) introduce phased watermarks which varies based on stage of reclaim
operation (e.g. watermark_lower()/watermark_higher() which resembles
preempt_disable()/preempt_enable() but is propagated to other threads
when delegating operations needed for reclaim to other threads).

(2) introduce dedicated kernel threads which perform only specific
reclaim operation, using watermark propagated from other threads
which performs different reclaim operation.

(3) remove direct reclaim which bothers callers with managing correct
GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
___GFP_DIRECT_RECLAIM callers can simply wait for
wait_event(get_pages_from_freelist() succeeds) rather than polling
with complicated short sleep. This will significantly save CPU
resource (especially when oom_lock is held) which is wasted by
activities by multiple concurrent direct reclaim.