Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom

From: Charan Teja Kalla
Date: Tue Oct 31 2023 - 09:14:22 EST

Next message: Jason Gunthorpe: "[GIT PULL] Please pull IOMMUFD subsystem changes"
Previous message: Richard Weinberger: "Re: linux-next: manual merge of the mtd tree with the vfs-brauner tree"
In reply to: Michal Hocko: "Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom"
Next in thread: Michal Hocko: "Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Thanks Michal/Pavan!!

On 10/31/2023 1:44 PM, Michal Hocko wrote:
> On Mon 30-10-23 18:09:50, Charan Teja Kalla wrote:
>> __alloc_pages_direct_reclaim() is called from slowpath allocation where
>> high atomic reserves can be unreserved after there is a progress in
>> reclaim and yet no suitable page is found. Later should_reclaim_retry()
>> gets called from slow path allocation to decide if the reclaim needs to
>> be retried before OOM kill path is taken.
>>
>> should_reclaim_retry() checks the available(reclaimable + free pages)
>> memory against the min wmark levels of a zone and returns:
>> a) true, if it is above the min wmark so that slow path allocation will
>> do the reclaim retries.
>> b) false, thus slowpath allocation takes oom kill path.
>>
>> should_reclaim_retry() can also unreserves the high atomic reserves
>> **but only after all the reclaim retries are exhausted.**
>>
>> In a case where there are almost none reclaimable memory and free pages
>> contains mostly the high atomic reserves but allocation context can't
>> use these high atomic reserves, makes the available memory below min
>> wmark levels hence false is returned from should_reclaim_retry() leading
>> the allocation request to take OOM kill path. This is an early oom kill
>> because high atomic reserves are holding lot of free memory and
>> unreserving of them is not attempted.
>
> OK, I see. So we do not release those reserved pages because OOM hits
> too early.
>
>> (early)OOM is encountered on a machine in the below state(excerpt from
>> the oom kill logs):
>> [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
>> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
>> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
>> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
>> local_pcp:492kB free_cma:0kB
>> [ 295.998656] lowmem_reserve[]: 0 32
>> [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
>> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
>> 0*4096kB = 7752kB
>
> OK, this is quite interesting as well. The system is really tiny and 8MB
> of reserved memory is indeed really high. How come those reservations
> have grown that high?

Actually it is a VM running on the Linux kernel.

Regarding the reservations, I think it is because of the 'max_managed '
calculations in the below:
static void reserve_highatomic_pageblock(struct page *page, ....) {
....
/*
* Limit the number reserved to 1 pageblock or roughly 1% of a zone.
* Check is race-prone but harmless.
*/
max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;

if (zone->nr_reserved_highatomic >= max_managed)
goto out;

zone->nr_reserved_highatomic += pageblock_nr_pages;
set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL);
out:
}

Since we are always appending the 1% of zone managed pages count to
pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the
'nr_reserved_highatomic' is incremented/decremented in pageblock size
granules.

And for my case the 8M out of ~50M is turned out to be 16%, which is high.

If the below looks fine to you, I can raise this as a separate change:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2a2536d..41441ced 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct
page *page, struct zone *zone)
* Limit the number reserved to 1 pageblock or roughly 1% of a zone.
* Check is race-prone but harmless.
*/
- max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages;
+ max_managed = max_t(unsigned long,
+ ALIGN(zone_managed_pages(zone) / 100,
pageblock_nr_pages),
+ pageblock_nr_pages);
if (zone->nr_reserved_highatomic >= max_managed)
return;

>>
>> Per above log, the free memory of ~7MB exist in the high atomic
>> reserves is not freed up before falling back to oom kill path.
>>
>> This fix includes unreserving these atomic reserves in the OOM path
>> before going for a kill. The side effect of unreserving in oom kill path
>> is that these free pages are checked against the high wmark. If
>> unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(),
>> they are checked against the min wmark levels.
>
> I do not like the fix much TBH. I think the logic should live in

yeah, This code looks way too cleaner to me. Let me know If I can raise
V2 with the below, suggested-by you.

I think another thing system is missing here is draining the pcp lists.
min:804kB low:1004kB high:1204kB free_pcp:688kB

IIUC, the drain pages is being called in reclaim path as below. In this
case, when did_some_progress = 0, it is also skipping the pcp drain.
struct page *__alloc_pages_direct_reclaim() {
.....
*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
if (unlikely(!(*did_some_progress)))
goto out;
retry:
page = get_page_from_freelist();
if (!page && !drained) {
drain_all_pages(NULL);
drained = true;
goto retry;
}
out:
}

so, how about the extending the below code from you for this case.
Assuming that did_some_progress > 0 means the draining perhaps already
done in __alloc_pages_direct_reclaim() thus:
out:
if (!ret) {
ret = unreserve_highatomic_pageblock(ac, true);
drain_all_pages(NULL);
}
return ret;

Please suggest If the above doesn't make sense. If Looks good, I will
raise a separate patch for this condition.
> should_reclaim_retry. One way to approach it is to unreserve at the end
> of the function, something like this:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 95546f376302..d04e14adf2c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3813,10 +3813,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> * Make sure we converge to OOM if we cannot make any progress
> * several times in the row.
> */
> - if (*no_progress_loops > MAX_RECLAIM_RETRIES) {
> - /* Before OOM, exhaust highatomic_reserve */
> - return unreserve_highatomic_pageblock(ac, true);
> - }
> + if (*no_progress_loops > MAX_RECLAIM_RETRIES)
> + goto out;
>
> /*
> * Keep reclaiming pages while there is a chance this will lead
> @@ -3859,6 +3857,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> schedule_timeout_uninterruptible(1);
> else
> cond_resched();
> +
> +out:
> + /* Before OOM, exhaust highatomic_reserve */
> + if (!ret)
> + return unreserve_highatomic_pageblock(ac, true);
> +
> return ret;
> }
>

Next message: Jason Gunthorpe: "[GIT PULL] Please pull IOMMUFD subsystem changes"
Previous message: Richard Weinberger: "Re: linux-next: manual merge of the mtd tree with the vfs-brauner tree"
In reply to: Michal Hocko: "Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom"
Next in thread: Michal Hocko: "Re: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]