Re: [PATCH 14/15] mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup
From: Dan Williams
Date: Fri Oct 02 2015 - 17:53:33 EST
On Fri, Oct 2, 2015 at 2:21 PM, Logan Gunthorpe <logang@xxxxxxxxxxxx> wrote:
> Hi Dan,
>
> We've been doing some experimenting and testing with this patchset.
> Specifically, we are trying to use you're ZONE_DEVICE work to enable
> peer to peer PCIe transfers. This is actually working pretty well
> (though we're still testing and working through some things).
Hmm, I didn't have peer-to-peer PCI-E in mind for this mechanism, but
the test report is welcome nonetheless. The definition of dma_addr_t
is the device view of host memory, not necessarily the device view of
a peer device's memory range, so I expect you'll run into issues with
IOMMUs and other parts of the kernel that assume this definition.
>
> However, we've found a couple of issues:
>
> On Wed, Sep 23, 2015 at 12:42:27AM -0400, Dan Williams wrote:
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 3d6baa7d4534..20097e7b679a 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -49,12 +49,16 @@ struct page {
>> * updated asynchronously */
>> union {
>> struct address_space *mapping; /* If low bit clear, points to
>> - * inode address_space, or NULL.
>> + * inode address_space, unless
>> + * the page is in ZONE_DEVICE
>> + * then it points to its parent
>> + * dev_pagemap, otherwise NULL.
>> * If page mapped as anonymous
>> * memory, low bit is set, and
>> * it points to anon_vma object:
>> * see PAGE_MAPPING_ANON below.
>> */
>> + struct dev_pagemap *pgmap;
>> void *s_mem; /* slab first object */
>> };
>
>
> When you add to this union and overide the mapping value, we see bugs
> in calls to set_page_dirty when it tries to dereference mapping. I believe
> a change to page_mapping is required such as the patch that's at the end of
> this email.
Yes, this location for dev_pagemap will not work. I've since moved it
to a union with the lru list_head since ZONE_DEVICE pages memory
should always have an elevated page count and never land on a slab
allocator lru.
>> diff --git a/mm/gup.c b/mm/gup.c
>> index a798293fc648..1064e9a489a4 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -98,7 +98,16 @@ retry:
>> }
>>
>> page = vm_normal_page(vma, address, pte);
>> - if (unlikely(!page)) {
>> + if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
>> + /*
>> + * Only return device mapping pages in the FOLL_GET case since
>> + * they are only valid while holding the pgmap reference.
>> + */
>> + if (get_dev_pagemap(pte_pfn(pte), NULL))
>> + page = pte_page(pte);
>> + else
>> + goto no_page;
>> + } else if (unlikely(!page)) {
>
> I've found that if a driver creates a ZONE_DEVICE mapping but doesn't
> create the pagemap (using devm_register_pagemap) then the get_user_pages code
> will go into an infinite loop. I'm not really sure if this as an issue or
> not but it seems a bit undesirable for a buggy driver to be able to cause this.
>
> My thoughts are that either devm_register_pagemap needs to be done by
> devm_memremap_pages so a driver cannot use one without the other,
> or the GUP code needs to return EFAULT if no pagemap was registered so
> it doesn't loop forever.
Exactly, we should fail (-EFAULT) get_user_pages() in that case since
we don't have a mechanism to pin down the mapping. I'll track down
what's causing the loop.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/