Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

From: John Hubbard
Date: Thu Dec 19 2019 - 16:16:50 EST


On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
On 12/19/19 5:26 AM, Leon Romanovsky wrote:
On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.

Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
ib_umem_geti() ->
pin_user_pages_fast(FOLL_WRITE) ->
internal_get_user_pages_fast(FOLL_WRITE) ->
gup_pgd_range() ->
gup_huge_pd() ->
gup_hugepte() ->
try_grab_compound_head() ->

Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.

Considering that establishing these pins is entirely under user
control, we can't have a limit here.

There's already a limit, it's just a much larger one. :) What does "no limit"
really mean, numerically, to you in this case?


If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.


I'll poke around the IB call stack and see how much of that return path
is in place, if any. Because it's the same situation for get_user_pages_fast().
This code just added a warning on overflow so we could spot it early.


3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)

Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py

If you get things that far I think Leon can get a reproduction for you


OK, here goes.

thanks,
--
John Hubbard
NVIDIA