[PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*()

From: john . hubbard
Date: Sat Jun 16 2018 - 21:25:53 EST


From: John Hubbard <jhubbard@xxxxxxxxxx>

This fixes a few problems that come up when using devices (NICs, GPUs,
for example) that want to have direct access to a chunk of system (CPU)
memory, so that they can DMA to/from that memory. Problems [1] come up
if that memory is backed by persistence storage; for example, an ext4
file system. I've been working on several customer bugs that are hitting
this, and this patchset fixes those bugs.

The bugs happen via:

1) get_user_pages() on some ext4-backed pages
2) device does DMA for a while to/from those pages

a) Somewhere in here, some of the pages get disconnected from the
file system, via try_to_unmap() and eventually drop_buffers()

3) device is all done, device driver calls set_page_dirty_lock(), then
put_page()

And then at some point, we see a this BUG():

kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
backtrace:
ext4_writepage
__writepage
write_cache_pages
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork

...which is due to the file system asserting that there are still buffer
heads attached:

({ \
BUG_ON(!PagePrivate(page)); \
((struct buffer_head *)page_private(page)); \
})

How to fix this:
----------------
Introduce a new page flag: PG_dma_pinned, and set this flag on
all pages that are returned by the get_user_pages*() family of
functions. Leave it set nearly forever: until the page is freed.

Then, check this flag before attempting to unmap pages. This will
cause a very early return from try_to_unmap_one(), and will avoid
doing things such as, notably, removing page buffers via drop_buffers().

This uses a new struct page flag, but only on 64-bit systems.

Obviously, this is heavy-handed, but given the long, broken history of
get_user_pages in combination with file-backed memory, and given the
problems with alternative designs, it's a reasonable fix for now: small,
simple, and easy to revert if and when a more comprehensive design solution
is chosen.

Some alternatives, and why they were not taken:

1. It would be better, if possible, to clear PG_dma_pinned, once all
get_user_pages callers returned the page (via something more specific than
put_page), but that would significantly change the usage for get_user_pages
callers. That's too intrusive for such a widely used and old API, so let's
leave it alone.

Also, such a design would require a new counter that would be associated
with each page. There's no room in struct page, so it would require
separate tracking, which is not acceptable for general page management.

2. There are other more complicated approaches[2], but these depend on
trying to solve very specific call paths that, in the end, are just
downstream effects of the root cause. And so these did not actually fix the
customer bugs that I was working on.

References:

[1] https://lwn.net/Articles/753027/ : "The trouble with get_user_pages()"

[2] https://marc.info/?l=linux-mm&m=<20180521143830.GA25109@xxxxxxxxxxxxxxxxxxxxxx>
(Matthew Wilcox listed two ideas here)

Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
---
include/linux/page-flags.h | 9 +++++++++
include/trace/events/mmflags.h | 9 ++++++++-
mm/gup.c | 11 +++++++++--
mm/rmap.c | 2 ++
4 files changed, 28 insertions(+), 3 deletions(-)

Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx>
---
include/linux/page-flags.h | 9 +++++++++
include/trace/events/mmflags.h | 9 ++++++++-
mm/gup.c | 11 +++++++++--
mm/page_alloc.c | 1 +
mm/rmap.c | 2 ++
5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 901943e4754b..ad65a2af069a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -100,6 +100,9 @@ enum pageflags {
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
+#endif
+#if defined(CONFIG_64BIT)
+ PG_dma_pinned,
#endif
__NR_PAGEFLAGS,

@@ -381,6 +384,12 @@ TESTCLEARFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif

+#if defined(CONFIG_64BIT)
+PAGEFLAG(DmaPinned, dma_pinned, PF_ANY)
+#else
+PAGEFLAG_FALSE(DmaPinned)
+#endif
+
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb76d89..f62fd150b0d4 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
#define IF_HAVE_PG_IDLE(flag,string)
#endif

+#if defined(CONFIG_64BIT)
+#define IF_HAVE_PG_DMA_PINNED(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_DMA_PINNED(flag,string)
+#endif
+
#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
{1UL << PG_waiters, "waiters" }, \
@@ -104,7 +110,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \
IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \
IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \
IF_HAVE_PG_IDLE(PG_young, "young" ) \
-IF_HAVE_PG_IDLE(PG_idle, "idle" )
+IF_HAVE_PG_IDLE(PG_idle, "idle" ) \
+IF_HAVE_PG_DMA_PINNED(PG_dma_pinned, "dma_pinned" )

#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/gup.c b/mm/gup.c
index 73f0b3316fa7..fd6c77f33c16 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -659,7 +659,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *nonblocking)
{
- long i = 0;
+ long i = 0, j;
int err = 0;
unsigned int page_mask;
struct vm_area_struct *vma = NULL;
@@ -764,6 +764,10 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
} while (nr_pages);

out:
+ if (pages)
+ for (j = 0; j < i; j++)
+ SetPageDmaPinned(pages[j]);
+
return i ? i : err;
}

@@ -1843,7 +1847,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
unsigned long addr, len, end;
- int nr = 0, ret = 0;
+ int nr = 0, ret = 0, i;

start &= PAGE_MASK;
addr = start;
@@ -1864,6 +1868,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
ret = nr;
}

+ for (i = 0; i < nr; i++)
+ SetPageDmaPinned(pages[i]);
+
if (nr < nr_pages) {
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100f1e63..a96a7b20037c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1898,6 +1898,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
{
set_page_private(page, 0);
set_page_refcounted(page);
+ ClearPageDmaPinned(page);

arch_alloc_page(page, order);
kernel_map_pages(page, 1 << order, 1);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6db729dc4c50..37576f0a4645 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1360,6 +1360,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
flags & TTU_SPLIT_FREEZE, page);
}

+ if (PageDmaPinned(page))
+ return false;
/*
* We have to assume the worse case ie pmd for invalidation. Note that
* the page can not be free in this function as call of try_to_unmap()
--
2.17.1