Re: [PATCH] mm/page_alloc: try oom if reclaim is unable to make forward progress

From: Michal Hocko
Date: Fri Mar 26 2021 - 11:36:56 EST


On Fri 26-03-21 11:22:54, Aaron Tomlin wrote:
[...]
> > Both reclaim and compaction maintain their own retries counters as they
> > are targeting a different operation. Although the compaction really
> > depends on the reclaim to do some progress.
>
> Yes. Looking at should_compact_retry() if the last known compaction result
> was skipped i.e. suggesting there was not enough order-0 pages to support
> compaction, so assistance is needed from reclaim
> (see __compaction_suitable()).
>
> I noticed that the value of compaction_retries, compact_result and
> compact_priority was 0, COMPACT_SKIPPED and 1 i.e. COMPACT_PRIO_SYNC_LIGHT,
> respectively.
>
> > OK, this sound unexpected as it indicates that the reclaim is able to
> > make a forward progress but compaction doesn't want to give up and keeps
> > retrying. Are you able to reproduce this or could you find out which
> > specific condition keeps compaction retrying? I would expect that it is
> > one of the 3 conditions before the max_retries is checked.
>
> Unfortunately, I have been told it is not entirely reproducible.
> I suspect it is the following in should_compact_retry() - as I indicated
> above the last known value stored in compaction_retries was 0:
>
>
> if (order > PAGE_ALLOC_COSTLY_ORDER)
> max_retries /= 4;
> if (*compaction_retries <= max_retries) {
> ret = true;
> goto out;
> }

OK, I kinda expected this would be not easily reproducible. The reason I
dislike your patch is that it addes yet another criterion for oom while
we already do have 2 which doesn't make the resulting code easier to
reason about. We should be focusing on the compaction retry logic and
see whether we can have some "run away" scenarios there. Seeing so many
retries without compaction bailing out sounds like a bug in that retry
logic. Vlastimil is much more familiar with that.
--
Michal Hocko
SUSE Labs