[RFC] RDMA/umem: pin_user_pages*() can temporarily fail due to migration glitches

From: John Hubbard
Date: Tue Apr 30 2024 - 20:31:35 EST


pin_user_pages*() occasionally fails due to migrate_pages() failures
that, in turn, are due to temporarily elevated folio refcounts. This
happens because a few years ago, pin_user_pages*() APIs were upgraded to
automatically migrate pages away from ZONE_MOVABLE, but the callers were
not upgraded to handle any migration failures. And in fact, they can't
easily do so anyway, because the migration return code was filtered out:
-EAGAIN failures from migration are squashed, along with any other
failure, into -ENOMEM, thus hiding details from the upper layer callers.

One failure case that we ran into recently looks like this:

pin_user_pages_fast()
internal_get_user_pages_fast()
__gup_longterm_locked()
check_and_migrate_movable_pages()
migrate_longterm_unpinnable_pages()
migrate_pages()
migrate_pages_batch(..., NR_MAX_MIGRATE_PAGES_RETRY==10)
migrate_folio_move()
move_to_new_folio()
migrate_folio()
migrate_folio_extra()
folio_migrate_mapping()
if (folio_ref_count(folio) != expected_count)
return -EAGAIN;
// ...and migrate_longterm_unpinnable_pages()
// translates -EAGAIN to -ENOMEM

Although so far I have not pinpointed the cause of such transient
refcount increases, these are sufficiently common (and expected by the
entire design) that I think we have enough information to proceed
directly to a fix. This patch shows my preferred solution, which does
the following:

a) Restore the -EAGAIN return code: pass it unchanged all the way back
to pin_user_pages*() callers.

b) Then, retry pin_user_pages_fast() from ib_umem_get(), because that IB
driver is displaying real failures in the field, and we've confirmed
that a retry at this location will fix those failures. Retrying at this
higher level (as compared to the pre-existing, low-level retry in
migrate_pages_batch()) allows more things to happen, and more time to
elapse, between retries.

Cc: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxx>
Cc: Leon Romanovsky <leon@xxxxxxxxxx>
Cc: Artemy Kovalyov <artemyko@xxxxxxxxxx>
Cc: Michael Guralnik <michaelgur@xxxxxxxxxx>
Cc: Alistair Popple <apopple@xxxxxxxxxx>
Cc: Pak Markthub <pmarkthub@xxxxxxxxxx>
Cc: Jason Gunthorpe <jgg@xxxxxxxxxx>
Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
---
drivers/infiniband/core/umem.c | 33 ++++++++++++++++++++++++++-------
mm/gup.c | 14 +++++++++++---
2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..7c691faacc8a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -210,15 +210,34 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,

while (npages) {
cond_resched();
- pinned = pin_user_pages_fast(cur_base,
- min_t(unsigned long, npages,
- PAGE_SIZE /
- sizeof(struct page *)),
- gup_flags, page_list);
- if (pinned < 0) {
+ pinned = -ENOMEM;
+ int attempts = 0;
+ /*
+ * pin_user_pages_fast() can return -EAGAIN, due to falling back
+ * to gup-slow and then failing to migrate pages out of
+ * ZONE_MOVABLE due to a transient elevated page refcount.
+ *
+ * One retry is enough to avoid this problem, so far, but let's
+ * use a slightly higher retry count just in case even larger
+ * systems have a longer-lasting transient refcount problem.
+ *
+ */
+ static const int MAX_ATTEMPTS = 3;
+
+ while (pinned == -EAGAIN && attempts < MAX_ATTEMPTS) {
+ pinned = pin_user_pages_fast(cur_base,
+ min_t(unsigned long,
+ npages, PAGE_SIZE /
+ sizeof(struct page *)),
+ gup_flags, page_list);
ret = pinned;
- goto umem_release;
+ attempts++;
+
+ if (pinned == -EAGAIN)
+ continue;
}
+ if (pinned < 0)
+ goto umem_release;

cur_base += pinned * PAGE_SIZE;
npages -= pinned;
diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..edb069f937cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2141,15 +2141,23 @@ static int migrate_longterm_unpinnable_pages(
}

if (!list_empty(movable_page_list)) {
+ int rc;
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
};

- if (migrate_pages(movable_page_list, alloc_migration_target,
+ rc = migrate_pages(movable_page_list, alloc_migration_target,
NULL, (unsigned long)&mtc, MIGRATE_SYNC,
- MR_LONGTERM_PIN, NULL)) {
- ret = -ENOMEM;
+ MR_LONGTERM_PIN, NULL);
+ if (rc) {
+ /*
+ * Report any failure *except* -EAGAIN as "not enough
+ * memory". -EAGAIN is valuable because callers further
+ * up the call stack can benefit from a retry.
+ */
+ if (rc != -EAGAIN)
+ ret = -ENOMEM;
goto err;
}
}

base-commit: 18daea77cca626f590fb140fc11e3a43c5d41354
--
2.45.0