Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon

From: Vlastimil Babka
Date: Wed Apr 29 2020 - 03:51:44 EST

Next message: Lee Jones: "Re: [PATCH v1 1/2] dt-bindings: mfd: Document QTI I2C PMIC controller"
Previous message: SeongJae Park: "Re: Re: [PATCH v9 00/15] Introduce Data Access MONitor (DAMON)"
In reply to: Tetsuo Handa: "Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon"
Next in thread: Michal Hocko: "Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/28/20 11:48 PM, David Rientjes wrote:
> On Tue, 28 Apr 2020, Vlastimil Babka wrote:
>
> Yes, order-0 reclaim capture is interesting since the issue being reported
> here is userspace going out to lunch because it loops for an unbounded
> amount of time trying to get above a watermark where it's allowed to
> allocate and other consumers are depleting that resource.
>
> We actually prefer to oom kill earlier rather than being put in a
> perpetual state of aggressive reclaim that affects all allocators and the
> unbounded nature of those allocations leads to very poor results for
> everybody.

Sure. My vague impression is that your (and similar cloud companies) kind of
workloads are designed to maximize machine utilization, and overshooting and
killing something as a result is no big deal. Then you perhaps have more
probability of hitting this state, and on the other hand, even an occasional
premature oom kill is not a big deal?

My concers are workloads not designed in such a way, where premature oom kill
due to temporary higher reclaim activity together with burst of incoming network
packets will result in e.g. killing an important database. There, the tradeoff
looks different.

> I'm happy to scope this solely to an order-0 reclaim capture. I'm not
> sure if I'm clear on whether this has been worked on before and patches
> existed in the past?

Andrew mentioned some. I don't recall any, so it might have been before my time.

> Somewhat related to what I described in the changelog: we lost the "page
> allocation stalls" artifacts in the kernel log for 4.15. The commit
> description references an asynchronous mechanism for getting this
> information; I don't know where this mechanism currently lives.
>

Next message: Lee Jones: "Re: [PATCH v1 1/2] dt-bindings: mfd: Document QTI I2C PMIC controller"
Previous message: SeongJae Park: "Re: Re: [PATCH v9 00/15] Introduce Data Access MONitor (DAMON)"
In reply to: Tetsuo Handa: "Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon"
Next in thread: Michal Hocko: "Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start failing soon"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]