Re: [PATCH v6] mm: Add memory allocation watchdog kernel thread.

From: Tetsuo Handa
Date: Tue Feb 21 2017 - 21:12:33 EST


Johannes, any questions/comments on this patch?
I'd like to send to linux-next because this patch will be useful for
examining problems like "Bug 192981 - page allocation stalls". I think
that this patch will remain be useful as long as we use direct reclaim.

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-01-17 13:11:50, Johannes Weiner wrote:
> > [...]
> > > >From 6420cae52cac8167bd5fb19f45feed2d540bc11d Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > Date: Wed, 25 Jan 2017 12:57:20 -0500
> > > Subject: [PATCH] mm: page_alloc: __GFP_NOWARN shouldn't suppress stall
> > > warnings
> > >
> > > __GFP_NOWARN, which is usually added to avoid warnings from callsites
> > > that expect to fail and have fallbacks, currently also suppresses
> > > allocation stall warnings. These trigger when an allocation is stuck
> > > inside the allocator for 10 seconds or longer.
> > >
> > > But there is no class of allocations that can get legitimately stuck
> > > in the allocator for this long. This always indicates a problem.
> > >
> > > Always emit stall warnings. Restrict __GFP_NOWARN to alloc failures.
> >
> > Tetsuo has already suggested something like this and I didn't really
>
> Yes, I already suggested it at
> http://lkml.kernel.org/r/1484132120-35288-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx .
>
> > like it because it makes the semantic of the flag confusing. The mask
> > says to not warn while the kernel log might contain an allocation splat.
> > You are right that stalling for 10s seconds means a problem on its own
> > but on the other hand I can imagine somebody might really want to have
> > clean logs and the last thing we want is to have another gfp flag for
> > that purpose.
>
> I agree with Johannes about that __GFP_NOWARN should not suppress allocation
> stall warnings. But having another gfp flag for that purpose is not useful.
> Given that someone really wants not to have allocation stall warnings in the
> kernel logs, where is the switch for enabling/disabling allocation stall
> warnings (because gfp flags are constant determined at build time)? We will
> need to have either a kernel command line option or a sysctl (or sysfs)
> variable. khungtaskd uses sysctl variables for those who really wants not
> to have TASK_UNINTERRUPTIBLE warnings; so does kmallocwd.
>
> >
> > I also do not think that this change would make a big difference because
> > most allocations simply use this flag along with __GFP_NORETRY or
> > GFP_NOWAIT resp GFP_ATOMIC. Have we ever seen a stall with this
> > allocation requests?
>
> You are totally ignoring what I explained in the "[PATCH] mm: Ignore
> __GFP_NOWARN when reporting stalls" thread shown above.
>
> Majority of __GFP_DIRECT_RECLAIM allocation requests are tolerable with
> allocation failure (and they will be willing to give up upon SIGKILL if
> they are from syscall) and do not need to alarm the admin to do any action.
> If they are not tolerable with allocation failure, they will add __GFP_NOFAIL.
>
> Apart from the reality that they are not tested well because they are
> currently protected by the "too small to fail" memory-allocation rule,
> they are ready to add __GFP_NOWARN. And current behavior (i.e. !costly
> __GFP_DIRECT_RECLAIM allocation requests won't fail unless __GFP_NORETRY
> is set or TIF_MEMDIE is set after SIGKILL was delivered) keeps them away
> from adding __GFP_NOFAIL.
>
> >
> > I haven't nacked Tetsuo's patch AFAIR and will not nack this one either
> > I just do not think we should tweak __GFP_NOWARN.
>
> Leaving this proposal as it is is counterproductive. I already said
>
> The discussion at this stage should not be "whether we need such
> watchdog and debugging code" but should be "how we can make the impact
> of watchdog and debugging code as small as possible".
>
> at http://lkml.kernel.org/r/1462630604-23410-1-git-send-email-penguin-kernel@xxxxxxxxxxxxxxxxxxx .
> And there had been no response.
>
> Johannes Weiner wrote:
> > On Sun, Nov 06, 2016 at 04:15:01PM +0900, Tetsuo Handa wrote:
> > > +- Why need to use it?
> > > +
> > > +Currently, when something went wrong inside memory allocation request,
> > > +the system might stall without any kernel messages.
> > > +
> > > +Although there is khungtaskd kernel thread as an asynchronous monitoring
> > > +approach, khungtaskd kernel thread is not always helpful because memory
> > > +allocating tasks unlikely sleep in uninterruptible state for
> > > +/proc/sys/kernel/hung_task_timeout_secs seconds.
> > > +
> > > +Although there is warn_alloc() as a synchronous monitoring approach
> > > +which emits
> > > +
> > > + "%s: page allocation stalls for %ums, order:%u, mode:%#x(%pGg)\n"
> > > +
> > > +line, warn_alloc() is not bullet proof because allocating tasks can get
> > > +stuck before calling warn_alloc() and/or allocating tasks are using
> > > +__GFP_NOWARN flag and/or such lines are suppressed by ratelimiting and/or
> > > +such lines are corrupted due to collisions.
> >
> > I'm not fully convinced by this explanation. Do you have a real life
> > example where the warn_alloc() stall info is not enough? If yes, this
> > should be included here and in the changelog. If not, the extra code,
> > the task_struct overhead etc. don't seem justified.
>
> warn_alloc() stall info cannot provide overall analyses. I said
>
> If you meant (b), it is because synchronous watchdog is not reliable and
> cannot provide enough diagnostic information. Since allocation livelock
> involves several threads due to dependency, it is important to take a
> snapshot of possibly relevant threads. By using asynchronous watchdog,
> we can not only take a snapshot but also take more actions for obtaining
> diagnostic information (e.g. enabling tracepoints when allocation stalls
> are detected).
>
> in the same thread shown above. For example, the cause of allocation stall
> might be due to out of idle workqueue thread; e.g. commit 373ccbe5927034b5
> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
> any progress"). Without reporting all possibly relevant threads, we might
> fail to obtain enough diagnostic information. Changing warn_alloc() to
> also report workqueues/kswapd/locks etc. will be noisy and incomplete
> because warn_alloc() cares only current thread.
>
> I welcome suggestions for "how we can make the impact of watchdog and
> debugging code as small as possible". But I don't have environment for
> evaluating the task_struct overhead. I wonder whether this overhead matters
> because this is allocation slowpath which will consume a lot of CPU cycles
> for scanning and/or sleep for many jiffies waiting for I/O. It will be far
> cheaper than keeping mm related tracepoints enabled until something happens
> which might be uptime of months. Sending the v6 patch to linux-next and
> start evaluating the overhead will be the way to avoid leaving this proposal
> as it is.
>