Re: [PATCH] memremap: Fix NULL pointer BUG in get_zone_device_page()

From: Dan Williams
Date: Wed Aug 24 2016 - 01:49:38 EST


On Tue, Aug 23, 2016 at 8:58 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>> On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.kani@xxxxxxx> wrote:
>>>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu <toshi.kani@xxxxxxx>
>>>> wrote:
>>>> > On Tue, 2016-08-23 at 15:32 -0700, Dan Williams wrote:
>>>> >> On Tue, Aug 23, 2016 at 11:43 AM, Toshi Kani <toshi.kani@xxxxxxx>
>>>> >> wrote:
>>>> > :
>>>> >> I'm not sure about this fix. The point of honoring
>>>> >> vmem_altmap_offset() is because a portion of the resource that is
>>>> >> passed to devm_memremap_pages() also contains the metadata info
>>>> block
>>>> >> for the device. The offset says "use everything past this point for
>>>> >> pages". This may work for avoiding a crash, but it may corrupt info
>>>> >> block metadata in the process. Can you provide more information
>>>> >> about the failing scenario to be sure that we are not triggering a
>>>> >> fault on an address that is not meant to have a page mapping? I.e.
>>>> >> what is the host physical address of the page that caused this fault,
>>>> >> and is it valid?
>>>> >
>>>> > The fault address in question was the 2nd page of an NVDIMM range. I
>>>> > assumed this fault address was valid and needed to be handled. Here is
>>>> > some info about the base and patched cases. Let me know if you need
>>>> > more info.
>>>> >
>>>> > Base
>>>> > ====
>>>> >
>>>> > The following NVDIMM range was set to /dev/dax.
>>>>
>>>> With ndctl create-namespace or manually via sysfs? Specifically I'm
>>>> looking for what the 'align' attribute was set to when the
>>>> configuration was established. Can you provide a dump of the sysfs
>>>> attributes for the /dev/dax parent device?
>>>
>>> I used the ndctl command below.
>>> ndctl create-namespace -f -e namespace0.0 -m dax
>>>
>>> Here is additional info from my note for the base case.
>>>
>>> p {struct dev_pagemap} 0xffff88046d0453f0
>>> $3 = {
>>> altmap = 0xffff88046d045410,
>>> res = 0xffff88046d0453a8,
>>> ref = 0xffff88046d0452f0,
>>> dev = 0xffff880464790410
>>> }
>>>
>>> crash> p {struct vmem_altmap} 0xffff88046d045410
>>> $6 = {
>>> base_pfn = 0x480000,
>>> reserve = 0x2, // PHYS_PFN(SZ_8K)
>>> free = 0x101fe,
>>> align = 0x1fe,
>>> alloc = 0x10000
>>> }
>>
>> Ah, so, on second look the 0x490200000 data offset looks correct. The
>> total size of the address range is 16GB which equates to 256MB needed
>> for struct page, plus 2MB more to re-align the data on the next 2MB
>> boundary.
>>
>> The question now is why is the guest faulting on an access to an
>> address less than 0x490200000?
>
> Does the attached patch fix this for you?

Sorry, should be this much simpler patch that also mirrors what
driver/nvdimm/pmem.c is doing...
From 3369f0e825c12bb2f17c0f7d3ccecb7c60f645e0 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@xxxxxxxxx>
Date: Tue, 23 Aug 2016 19:59:31 -0700
Subject: [PATCH] dax: fix device-dax region base

The data offset for a dax region needs to account for an altmap
reservation in the resource range. Otherwise, device-dax is allowing
mappings directly into the memmap or device info-block area, with crash
signatures like the following:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff811ac851>] get_zone_device_page+0x11/0x30
Call Trace:
follow_devmap_pmd+0x298/0x2c0
follow_page_mask+0x275/0x530
__get_user_pages+0xe3/0x750
__gfn_to_pfn_memslot+0x1b2/0x450 [kvm]
? hrtimer_try_to_cancel+0x2c/0x120
? kvm_read_l1_tsc+0x55/0x60 [kvm]
try_async_pf+0x66/0x230 [kvm]
? kvm_host_page_size+0x90/0xa0 [kvm]
tdp_page_fault+0x130/0x280 [kvm]
kvm_mmu_page_fault+0x5f/0xf0 [kvm]
handle_ept_violation+0x94/0x180 [kvm_intel]
vmx_handle_exit+0x1d3/0x1440 [kvm_intel]
? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel]
? vmx_vcpu_run+0x2d1/0x490 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm]
? wake_up_q+0x44/0x80
kvm_vcpu_ioctl+0x33c/0x620 [kvm]
? __vfs_write+0x37/0x160
do_vfs_ioctl+0xa2/0x5d0
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x1a/0xa4

Cc: <stable@xxxxxxxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Fixes: ab68f2622136 ("/dev/dax, pmem: direct access to persistent memory")
Reported-by: Abhilash Kumar Mulumudi <m.abhilash-kumar@xxxxxxx>
Reported-by: Toshi Kani <toshi.kani@xxxxxxx>
Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
---
drivers/dax/pmem.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index dfb168568af1..1f01e98c83c7 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -116,6 +116,9 @@ static int dax_pmem_probe(struct device *dev)
if (rc)
return rc;

+ /* adjust the dax_region resource to the start of data */
+ res.start += le64_to_cpu(pfn_sb->dataoff);
+
nd_region = to_nd_region(dev->parent);
dax_region = alloc_dax_region(dev, nd_region->id, &res,
le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP);
--
2.5.5