Re: [Question] Missing data after DMA read transfer - mm issue with transparent huge page?

From: Nicolas Morey Chaisemartin
Date: Tue May 10 2016 - 07:15:25 EST




Le 05/10/2016 à 12:01 PM, Jerome Glisse a écrit :
> On Tue, May 10, 2016 at 09:04:36AM +0200, Nicolas Morey Chaisemartin wrote:
>> Le 05/03/2016 à 12:11 PM, Jerome Glisse a écrit :
>>> On Mon, May 02, 2016 at 09:04:02PM -0700, Hugh Dickins wrote:
>>>> On Fri, 29 Apr 2016, Nicolas Morey Chaisemartin wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> This is a repost from a different address as it seems the previous one ended in Gmail junk due to a domain error..
>>>> linux-kernel is a very high volume list which few are reading:
>>>> that also will account for your lack of response so far
>>>> (apart from the indefatigable Alan).
>>>>
>>>> I've added linux-mm, and some people from another thread regarding
>>>> THP and get_user_pages() pins which has been discussed in recent days.
>>>>
>>>> Make no mistake, the issue you're raising here is definitely not the
>>>> same as that one (which is specifically about the new THP refcounting
>>>> in v4.5+, whereas you're reporting a problem you've seen in both a
>>>> v3.10-based kernel and in v4.5). But I think their heads are in
>>>> gear, much more so than mine, and likely to spot something.
>>>>
>>>>> I added more info found while blindly debugging the issue.
>>>>>
>>>>> Short version:
>>>>> I'm having an issue with direct DMA transfer from a device to host memory.
>>>>> It seems some of the data is not transferring to the appropriate page.
>>>>>
>>>>> Some more details:
>>>>> I'm debugging a home made PCI driver for our board (Kalray), attached to a x86_64 host running centos7 (3.10.0-327.el7.x86_64)
>>>>>
>>>>> In the current case, a userland application transfers back and forth data through read/write operations on a file.
>>>>> On the kernel side, it triggers DMA transfers through the PCI to/from our board memory.
>>>>>
>>>>> We followed what pretty much all docs said about direct I/O to user buffers:
>>>>>
>>>>> 1) get_user_pages() (in the current case, it's at most 16 pages at once)
>>>>> 2) convert to a scatterlist
>>>>> 3) pci_map_sg
>>>>> 4) eventually coalesce sg (Intel IOMMU is enabled, so it's usually possible)
>>>>> 4) A lot of DMA engine handling code, using the dmaengine layer and virt-dma
>>>>> 5) wait for transfer complete, in the mean time, go back to (1) to schedule more work, if any
>>>>> 6) pci_unmap_sg
>>>>> 7) for read (card2host) transfer, set_page_dirty_lock
>>>>> 8) page_cache_release
>>>>>
>>>>> In 99,9999% it works perfectly.
>>>>> However, I have one userland application where a few pages are not written by a read (card2host) transfer.
>>>>> The buffer is memset them to a different value so I can check that nothing has overwritten them.
>>>>>
>>>>> I know (PCI protocol analyser) that the data left our board for the "right" address (the one set in the sg by pci_map_sg).
>>>>> I tried reading the data between the pci_unmap_sg and the set_page_dirty, using
>>>>> uint32_t *addr = page_address(trans->pages[0]);
>>>>> dev_warn(&pdata->pdev->dev, "val = %x\n", *addr);
>>>>> and it has the expected value.
>>>>> But if I try to copy_from_user (using the address coming from userland, the one passed to get_user_pages), the data has not been written and I see the memset value.
>>>>>
>>>>> New infos:
>>>>>
>>>>> The issue happens with IOMMU on or off.
>>>>> I compiled a kernel with DMA_API_DEBUG enabled and got no warnings or errors.
>>>>>
>>>>> I digged a little bit deeper with my very small understanding of linux mm and I discovered that:
>>>>> * we are using transparent huge pages
>>>>> * the page 'not transferred' are the last few of a huge page
>>>>> More precisely:
>>>>> - We have several transfer in flight from the same user buffer
>>>>> - Each transfer is 16 pages long
>>>>> - At one point in time, we start transferring from another huge page (transfers are still in flight from the previous one)
>>>>> - When a transfer from the previous huge page completes, I dumped at the mapcount of the pages from the previous transfers,
>>>>> they are all to 0. The pages are still mapped to dma at this point.
>>>>> - A get_user_page to the address of the completed transfer returns return a different struct page * then the on I had.
>>>>> But this is before I have unmapped/put_page them back. From my understanding this should not have happened.
>>>>>
>>>>> I tried the same code with a kernel 4.5 and encountered the same issue
>>>>>
>>>>> Disabling transparent huge pages makes the issue disapear
>>>>>
>>>>> Thanks in advance
>>>> It does look to me as if pages are being migrated, despite being pinned
>>>> by get_user_pages(): and that would be wrong. Originally I intended
>>>> to suggest that THP is probably merely the cause of compaction, with
>>>> compaction causing the page migration. But you posted very interesting
>>>> details in an earlier mail on 27th April from <nmorey@xxxxxxxxx>:
>>>>
>>>>> I ran some more tests:
>>>>>
>>>>> * Test is OK if transparent huge tlb are disabled
>>>>>
>>>>> * For all the page where data are not transfered, and only those pages, a call to get_user_page(user vaddr) just before dma_unmap_sg returns a different page from the original one.
>>>>> [436477.927279] mppa 0000:03:00.0: org_page= ffffea0009f60080 cur page = ffffea00074e0080
>>>>> [436477.927298] page:ffffea0009f60080 count:0 mapcount:1 mapping: (null) index:0x2
>>>>> [436477.927314] page flags: 0x2fffff00008000(tail)
>>>>> [436477.927354] page dumped because: org_page
>>>>> [436477.927369] page:ffffea00074e0080 count:0 mapcount:1 mapping: (null) index:0x2
>>>>> [436477.927382] page flags: 0x2fffff00008000(tail)
>>>>> [436477.927421] page dumped because: cur_page
>>>>>
>>>>> I'm not sure what to make of this...
>>>> That (on the older kernel I think) seems clearly to show that a THP
>>>> itself has been migrated: which makes me suspect NUMA migration of
>>>> mispaced THPs - migrate_misplaced_transhuge_page(). I'd hoped to
>>>> find something obviously wrong there, but haven't quite managed
>>>> to bring my brain fully to bear on it, and hope the others Cc'ed
>>>> will do so more quickly (or spot the error of your ways instead).
>>>>
>>>> I do find it suspect, how the migrate_page_copy() is done rather
>>>> early, while the old page is still mapped in the pagetable. And
>>>> odd how it inserts the new pmd for a moment, before checking old
>>>> page_count and backing out. But I don't see how either of those
>>>> would cause the trouble you see, where the migration goes ahead.
>>> So i do not think there is a bug migrate_misplaced_transhuge_page()
>>> but i think something is wrong in it see attached patch. I still
>>> want to convince myself i am not missing anything before posting
>>> that one.
>>>
>>>
>>> Now about this bug, dumb question but do you do get_user_pages with
>>> write = 1 because if your device is writting to the page then you
>>> must set write to 1.
>>>
>>> get_user_pages(vaddr, nrpages, 1, 0|1, pages, NULL|vmas);
>>>
>>>
>>> Cheers,
>>> Jérôme
>>>
>>> 0001-mm-numa-thp-fix-assumptions-of-migrate_misplaced_tra.patch
>>>
>>>
>>> From 9ded2a5da75a5e736fb36a2c4e2511d9516ecc37 Mon Sep 17 00:00:00 2001
>>> From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@xxxxxxxxxx>
>>> Date: Tue, 3 May 2016 11:53:24 +0200
>>> Subject: [PATCH] mm/numa/thp: fix assumptions of
>>> migrate_misplaced_transhuge_page()
>>> MIME-Version: 1.0
>>> Content-Type: text/plain; charset=UTF-8
>>> Content-Transfer-Encoding: 8bit
>>>
>>> Fix assumptions in migrate_misplaced_transhuge_page() which is only
>>> call by do_huge_pmd_numa_page() itself only call by __handle_mm_fault()
>>> for pmd with PROT_NONE. This means that if the pmd stays the same
>>> then there can be no concurrent get_user_pages / get_user_pages_fast
>>> (GUP/GUP_fast). More over because migrate_misplaced_transhuge_page()
>>> only do something is page is map once then there can be no GUP from
>>> a different process. Finaly, holding the pmd lock assure us that no
>>> other part of the kernel will take an extre reference on the page.
>>>
>>> In the end this means that the failure code path should never be
>>> taken unless something is horribly wrong, so convert it to BUG_ON().
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx
>>> Cc: Mel Gorman <mgorman@xxxxxxx>
>>> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
>>> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
>>> ---
>>> mm/migrate.c | 31 +++++++++++++++++++++----------
>>> 1 file changed, 21 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 6c822a7..6315aac 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1757,6 +1757,14 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>> pmd_t orig_entry;
>>>
>>> /*
>>> + * What we do here is only valid if pmd_protnone(entry) is true and it
>>> + * is map in only one vma numamigrate_isolate_page() takes care of that
>>> + * check.
>>> + */
>>> + if (!pmd_protnone(entry))
>>> + goto out_unlock;
>>> +
>>> + /*
>>> * Rate-limit the amount of data that is being migrated to a node.
>>> * Optimal placement is no good if the memory bus is saturated and
>>> * all the time is being spent migrating!
>>> @@ -1797,7 +1805,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>>> mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>>> ptl = pmd_lock(mm, pmd);
>>> if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
>>> -fail_putback:
>>> spin_unlock(ptl);
>>> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>>>
>>> @@ -1819,7 +1826,12 @@ fail_putback:
>>> goto out_unlock;
>>> }
>>>
>>> - orig_entry = *pmd;
>>> + /*
>>> + * We are holding the lock so no one can set a new pmd and original pmd
>>> + * is PROT_NONE thus no one can get_user_pages or get_user_pages_fast
>>> + * (GUP or GUP_fast) from this point on we can not fail.
>>> + */
>>> + orig_entry = entry;
>>> entry = mk_pmd(new_page, vma->vm_page_prot);
>>> entry = pmd_mkhuge(entry);
>>> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>>> @@ -1837,14 +1849,13 @@ fail_putback:
>>> set_pmd_at(mm, mmun_start, pmd, entry);
>>> update_mmu_cache_pmd(vma, address, &entry);
>>>
>>> - if (page_count(page) != 2) {
>>> - set_pmd_at(mm, mmun_start, pmd, orig_entry);
>>> - flush_pmd_tlb_range(vma, mmun_start, mmun_end);
>>> - mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
>>> - update_mmu_cache_pmd(vma, address, &entry);
>>> - page_remove_rmap(new_page, true);
>>> - goto fail_putback;
>>> - }
>>> + /* As said above no one can get reference on the old page nor through
>>> + * get_user_pages or get_user_pages_fast (GUP/GUP_fast) or through
>>> + * any other means. To get reference on huge page you need to hold
>>> + * pmd_lock and we are already holding that lock here and the page
>>> + * is only mapped once.
>>> + */
>>> + BUG_ON(page_count(page) != 2);
>>>
>>> mlock_migrate_page(new_page, page);
>>> page_remove_rmap(page, true);
>> Hi,
>>
>> I backported the patch to 3.10 (had to copy paste pmd_protnone defitinition from 4.5) and it's working !
>> I'll open a ticket in Redhat tracker to try and get this fixed in RHEL7.
>>
>> I have a dumb question though: how can we end up in numa/misplaced memory code on a single socket system?
>>
> This patch is not a fix, do you see bug message in kernel log ? Because if
> you do that it means we have a bigger issue.
I don't see any on my 3.10. I have DMA_API_DEBUG enabled but I don't think it has an impact.
> You did not answer one of my previous question, do you set get_user_pages
> with write = 1 as a paremeter ?
For the read from the device, yes:
down_read(&current->mm->mmap_sem);
res = get_user_pages(
current,
current->mm,
(unsigned long) iov->host_addr,
page_count,
(write_mode == 0) ? 1 : 0, /* write */
0, /* force */
&trans->pages[sg_o],
NULL);
up_read(&current->mm->mmap_sem);

> Also it would be a lot easier if you were testing with lastest 4.6 or 4.5
> not RHEL kernel as they are far appart and what might looks like same issue
> on both might be totaly different bugs.
Is a RPM from elrepo ok?
http://elrepo.org/linux/kernel/el7/SRPMS/

I don't know why but a simple make install from the kernel source tree won't boot. I'm working remotely so it's hard to debug.
Rebuilding a kernel RPM seems to work though.
> If you only really care about RHEL kernel then open a bug with Red Hat and
> you can add me in bug-cc <jglisse@xxxxxxxxxx>
I opened one. I'll CC you

Thanks

Nicolas