Re: [patch v3] mm, page_alloc: reintroduce page allocation stall warning
From: David Rientjes
Date: Tue Mar 31 2026 - 12:54:49 EST
On Mon, 30 Mar 2026, Shakeel Butt wrote:
> > Previously, we had warnings when a single page allocation took longer
> > than reasonably expected. This was introduced in commit 63f53dea0c98
> > ("mm: warn about allocations which stall for too long").
> >
> > The warning was subsequently reverted in commit 400e22499dd9 ("mm: don't
> > warn about allocations which stall for too long") because it was possible
> > to generate memory pressure that would effectively stall further progress
> > through printk execution.
> >
> > Page allocation stalls in excess of 10 seconds are always useful to debug
> > because they can result in severe userspace unresponsiveness. Adding
> > this artifact can be used to correlate with userspace going out to lunch
> > and to understand the state of memory at the time.
> >
> > There should be a reasonable expectation that this warning will never
> > trigger given it is very passive, it will only be emitted when a page
> > allocation takes longer than 10 seconds. If it does trigger, this
> > reveals an issue that should be fixed: a single page allocation should
> > never loop for more than 10 seconds without oom killing to make memory
> > available.
> >
> > Unlike the original implementation, this implementation only reports
> > stalls once for the system every 10 seconds. Otherwise, many concurrent
> > reclaimers could spam the kernel log unnecessarily. Stalls are only
> > reported when calling into direct reclaim.
> >
> > Acked-by: Vlastimil Babka (SUSE) <vbabka@xxxxxxxxxx>
> > Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx>
>
> Reviewed-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
>
> I am hoping that the reason you are reintroducing these warnings is
> because you already are seeing such cases in your production
> environment. Do you have anything interesting to share?
>
We don't have this patch in our production environment (yet). We've
been stress testing allocations for page faults with lots of concurrent
skb allocations that can keep us persistently below the per-zone min
watermarks and hope that this patch will shed some light on some
unresponsiveness issues that we've encountered if/when it happens.