Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?

From: Andrea Arcangeli
Date: Thu May 12 2016 - 11:57:44 EST


Hello Nicolas,

On Thu, May 12, 2016 at 05:31:52PM +0200, Nicolas Morey-Chaisemartin wrote:
>
>
> Le 05/12/2016 à 03:52 PM, Jerome Glisse a écrit :
> > On Thu, May 12, 2016 at 03:30:24PM +0200, Nicolas Morey-Chaisemartin wrote:
> >> Le 05/12/2016 à 11:36 AM, Jerome Glisse a écrit :
> >>> On Thu, May 12, 2016 at 08:07:59AM +0200, Nicolas Morey-Chaisemartin wrote:
> [...]
> >>>> With transparent_hugepage=never I can't see the bug anymore.
> >>>>
> >>> Can you test https://patchwork.kernel.org/patch/9061351/ with 4.5
> >>> (does not apply to 3.10) and without transparent_hugepage=never
> >>>
> >>> Jérôme
> >> Fails with 4.5 + this patch and with 4.5 + this patch + yours
> >>
> > There must be some bug in your code, we have upstream user that works
> > fine with the above combination (see drivers/vfio/vfio_iommu_type1.c)
> > i suspect you might be releasing the page pin too early (put_page()).
> In my previous tests, I checked the page before calling put_page and it has already changed.
> And I also checked that there is not multiple transfers in a single page at once.
> So I doubt it's that.
> >
> > If you really believe it is bug upstream we would need a dumb kernel
> > module that does gup like you do and that shows the issue. Right now
> > looking at code (assuming above patches applied) i can't see anything
> > that can go wrong with THP.
>
> The issue is that I doubt I'll be able to do that. We have had code running in production for at least a year without the issue showing up and now a single test shows this.
> And some tweak to the test (meaning memory footprint in the user space) can make the problem disappear.
>
> Is there a way to track what is happening to the THP? From the looks of it, the refcount are changed behind my back? Would kgdb with watch point work on this?
> Is there a less painful way?

Do you use fork()?

If you have threads and your DMA I/O granularity is smaller than
PAGE_SIZE, and a thread of the application in parent or child is
writing to another part of the page, the I/O can get lost (worse, it
doesn't get really lost but it goes to the child by mistake, instead
of sticking to the "mm" where you executed get_user_pages). This is
practically a bug in fork() but it's known. It can affect any app that
uses get_user_pages/O_DIRECT, fork() and uses thread and the I/O
granularity is smaller than PAGE_SIZE.

The same bug cannot happen with KSM or other things that can wrprotect
a page out of app control, because all things out of app control
checks there are no page pins before wrprotecting the page. So it's up
to the app to control "fork()".

To fix it, you should do one of: 1) use MADV_DONTFORK on the pinned
region, 2) prevent fork to run while you've pins taken with
get_user_pages or anyway while get_user_pages may be running
concurrently, 3) use a PAGE_SIZE I/O granularity and/or prevent the
threads to write to the other part of the page while DMA is running.

I'm not aware of other issues that could screw with page pins with THP
on kernels <=4.4, if there were, everything should fall apart
including O_DIRECT and qemu cache=none. The only issue I'm aware of
that can cause DMA to get lost with page pins is the aforementioned
one.

To debug it further, I would suggest to start by searching for "fork"
calls, and adding MADV_DONTFORK to the pinned region if there's any
fork() in your testcase.

Without being allowed to see the source there's not much else we can
do considering there's no sign of unknown bugs in this area in kernels
<=4.4.

All there is, is the known bug above, but apps that could be affected
by it, actively avoid it by using MADV_DONTFORK like with qemu
cache=none.

Thanks,
Andrea