Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality

From: Mike Kravetz
Date: Fri Sep 10 2021 - 20:11:34 EST


On 9/10/21 1:20 AM, Michal Hocko wrote:
> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
>>
>> I would say it's simply should_continue_reclaim() behaving similarly to
>> should_compact_retry(). We'll get compaction_suitable() returning
>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
>> and compaction indicates there's not enough base pages to begin with to form
>> a high-order pages. Since the stolen pages will appear on inactive lru, it
>> seems to be worth continuing reclaim to make enough free base pages for
>> compaction to no longer be skipped, because "inactive_lru_pages >
>> pages_for_compaction" is true.
>>
>> So, both should_continue_reclaim() and should_compact_retry() are unable to
>> recognize that reclaimed pages are being stolen and limit the retries in
>> that case. The scenario seems to be uncommon, otherwise we'd be getting more
>> reports of that.
>
> Thanks this sounds like a viable explanation. For some reason I have
> thought that a single reclaim round can take that much time but reading
> again it seems like a cumulative thing.
>
> I do agree that we should consider the above scenario when bailing out.
> It is simply unreasonable to reclaim so much memory without any forward
> progress.

A very simple patch to bail early eliminated the LONG stalls and
decreased the time of a bulk allocation request.

Recall, the use case is converting 1G huge pages to the corresponding
amount of 2MB huge pages. I have been told that 50GB is not an uncommon
amount of pages to 'convert'. So, test case is free 50GB hugetlb pages
followed by allocate 25600 2MB hugetlb pages. I have not done enough
runs to get anything statistically valid, but here are some ballpark
numbers.

Unmodified baseline:
-------------------
Unexpected number of 2MB pages after demote
Expected 25600, have 19944

real 0m42.092s
user 0m0.008s
sys 0m41.467s

Retries capped by patch below:
------------------------------
Unexpected number of 2MB pages after demote
Expected 25600, have 19395

real 0m12.951s
user 0m0.010s
sys 0m12.840s


The time of the operation is certainly better, but do note that we
allocated 549 fewer pages. So, bailing early might cause some failures
if we continue trying. It is all a tradeoff. One of the reasons for
considering demote functionality is that the conversion would be done in
the hugetlb code without getting the page allocator, recalim,
compaction, etc involved.

> A more robust way to address this problem (which is not really new)
> would be to privatize reclaimed pages in the direct reclaim context and
> reuse them in the compaction so that it doesn't have to just
> pro-actively reclaim to keep watermarks good enough. A normal low order
> requests would benefit from that as well because the reclaim couldn't be
> stolen from them as well.

That does sound interesting. Perhaps a longer term project.

> Another approach would be to detect the situation and treat reclaim
> success which immediatelly leads to compaction deferral due to
> watermarks as a failure.

I'll look into detecting and responding to this.

Simple limit retries patch:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9c..16f3055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
int no_progress_loops;
unsigned int cpuset_mems_cookie;
int reserve_flags;
+ unsigned long tries = 0, max_tries = 0;

/*
* We also sanity check to catch abuse of atomic reserves being used by
@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
gfp_mask &= ~__GFP_ATOMIC;
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ max_tries = 1Ul << order;

retry_cpuset:
compaction_retries = 0;
@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
}

retry:
+ tries++;
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
if (did_some_progress > 0 &&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
- &compaction_retries))
+ &compaction_retries)) {
+ /*
+ * In one pathological case, pages can be stolen immediately
+ * after reclaimed. It looks like we are making progress, and
+ * compaction_retries is not incremented. This could cause
+ * an indefinite number of retries. Cap the number of retries
+ * for costly orders.
+ */
+ if (max_tries && tries > max_tries)
+ goto nopage;
goto retry;
+ }


/* Deal with possible cpuset update races before we start OOM killing */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeae2f6..519e16e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
struct lruvec *target_lruvec;
bool reclaimable = false;
unsigned long file;
+ unsigned long tries = 0, max_tries = 0;

target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ max_tries = 1UL << sc->order;

again:
+ tries++;
memset(&sc->nr, 0, sizeof(sc->nr));

nr_reclaimed = sc->nr_reclaimed;
@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
wait_iff_congested(BLK_RW_ASYNC, HZ/10);

if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
- sc))
+ sc)) {
+ /*
+ * In one pathological case, pages can be stolen immediately
+ * after being reclaimed. This can cause an indefinite number
+ * of retries. Limit the number of retries for costly orders.
+ */
+ if (!current_is_kswapd() &&
+ max_tries && tries > max_tries &&
+ sc->nr_reclaimed > sc->nr_to_reclaim)
+ goto done;
goto again;
+ }

+done:
/*
* Kswapd gives up on balancing particular nodes after too
* many failures to reclaim anything from them and goes to


--
Mike Kravetz