Re: [PATCH v4] vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusive

From: Alex Williamson
Date: Wed Jun 29 2016 - 22:54:14 EST


On Thu, 30 Jun 2016 10:40:23 +0800
Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote:

> Hi Alex,
>
> On 2016/6/30 4:03, Alex Williamson wrote:
>
> > On Tue, 28 Jun 2016 13:47:23 -0600
> > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> >
> >> On Tue, 28 Jun 2016 18:09:46 +0800
> >> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote:
> >>
> >>> Hi, Alex
> >>>
> >>> On 2016/6/25 0:43, Alex Williamson wrote:
> >>>
> >>>> On Fri, 24 Jun 2016 23:37:02 +0800
> >>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>>> Hi, Alex
> >>>>>
> >>>>> On 2016/6/24 11:37, Alex Williamson wrote:
> >>>>>
> >>>>>> On Fri, 24 Jun 2016 10:52:58 +0800
> >>>>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>> On 2016/6/24 0:12, Alex Williamson wrote:
> >>>>>>>> On Mon, 30 May 2016 21:06:37 +0800
> >>>>>>>> Yongji Xie <xyjxie@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>>> +static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >>>>>>>>> +{
> >>>>>>>>> + struct resource *res;
> >>>>>>>>> + int bar;
> >>>>>>>>> + struct vfio_pci_dummy_resource *dummy_res;
> >>>>>>>>> +
> >>>>>>>>> + INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >>>>>>>>> +
> >>>>>>>>> + for (bar = PCI_STD_RESOURCES; bar <= PCI_STD_RESOURCE_END; bar++) {
> >>>>>>>>> + res = vdev->pdev->resource + bar;
> >>>>>>>>> +
> >>>>>>>>> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP))
> >>>>>>>>> + goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> + if (!(res->flags & IORESOURCE_MEM))
> >>>>>>>>> + goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> + /*
> >>>>>>>>> + * The PCI core shouldn't set up a resource with a
> >>>>>>>>> + * type but zero size. But there may be bugs that
> >>>>>>>>> + * cause us to do that.
> >>>>>>>>> + */
> >>>>>>>>> + if (!resource_size(res))
> >>>>>>>>> + goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> + if (resource_size(res) >= PAGE_SIZE) {
> >>>>>>>>> + vdev->bar_mmap_supported[bar] = true;
> >>>>>>>>> + continue;
> >>>>>>>>> + }
> >>>>>>>>> +
> >>>>>>>>> + if (!(res->start & ~PAGE_MASK)) {
> >>>>>>>>> + /*
> >>>>>>>>> + * Add a dummy resource to reserve the remainder
> >>>>>>>>> + * of the exclusive page in case that hot-add
> >>>>>>>>> + * device's bar is assigned into it.
> >>>>>>>>> + */
> >>>>>>>>> + dummy_res = kzalloc(sizeof(*dummy_res), GFP_KERNEL);
> >>>>>>>>> + if (dummy_res == NULL)
> >>>>>>>>> + goto no_mmap;
> >>>>>>>>> +
> >>>>>>>>> + dummy_res->resource.start = res->end + 1;
> >>>>>>>>> + dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >>>>>>>>> + dummy_res->resource.flags = res->flags;
> >>>>>>>>> + if (request_resource(res->parent,
> >>>>>>>>> + &dummy_res->resource)) {
> >>>>>>>>> + kfree(dummy_res);
> >>>>>>>>> + goto no_mmap;
> >>>>>>>>> + }
> >>>>>>>> Isn't it true that request_resource() only tells us that at a given
> >>>>>>>> point in time, no other drivers have reserved that resource? It seems
> >>>>>>>> like it does not guarantee that the resource isn't routed to another
> >>>>>>>> device or that another driver won't at some point attempt to request
> >>>>>>>> that same resource. So for example if a user constructs their initrd
> >>>>>>>> to bind vfio-pci to devices before other modules load, this
> >>>>>>>> request_resource() may succeed, at the expense of drivers loaded later
> >>>>>>>> now failing. The behavior will depend on driver load order and we're
> >>>>>>>> not actually insuring that the overflow resource is unused, just that
> >>>>>>>> we got it first. Can we do better? Am I missing something that
> >>>>>>>> prevents this? Thanks,
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>> Couldn't PCI resources allocator prevent this, which will find a
> >>>>>>> empty slot in the resource tree firstly, then try to request that
> >>>>>>> resource in allocate_resource() when a PCI device is probed.
> >>>>>>> And I'd like to know why a PCI device driver would attempt to
> >>>>>>> call request_resource()? Should this be done in PCI enumeration?
> >>>>>> Hi Yongji,
> >>>>>>
> >>>>>> Looks like most pci drivers call pci_request_regions(). From there the
> >>>>>> call path is:
> >>>>>>
> >>>>>> pci_request_selected_regions
> >>>>>> __pci_request_selected_regions
> >>>>>> __pci_request_region
> >>>>>> __request_mem_region
> >>>>>> __request_region
> >>>>>> __request_resource
> >>>>>>
> >>>>>> We see this driver ordering issue sometimes with users attempting to
> >>>>>> blacklist native pci drivers, trying to leave a device free for use by
> >>>>>> vfio-pci. If the device is a graphics card, the generic vesa or uefi
> >>>>>> driver can request device resources causing a failure when vfio-pci
> >>>>>> tries to request those same resources. I expect that unless it's a
> >>>>>> boot device, like vga in my example, the resources are not enabled
> >>>>>> until the driver opens the device, therefore the request_resource() call
> >>>>>> doesn't occur until that point.
> >>>>>>
> >>>>>> For another trivial example, look at /proc/iomem as you load and unload
> >>>>>> a driver, on my laptop with e1000e unloaded I see:
> >>>>>>
> >>>>>> e1200000-e121ffff : 0000:00:19.0
> >>>>>> e123e000-e123efff : 0000:00:19.0
> >>>>>>
> >>>>>> When e1000e is loaded, each of these becomes claimed by the e1000e
> >>>>>> driver:
> >>>>>>
> >>>>>> e1200000-e121ffff : 0000:00:19.0
> >>>>>> e1200000-e121ffff : e1000e
> >>>>>> e123e000-e123efff : 0000:00:19.0
> >>>>>> e123e000-e123efff : e1000e
> >>>>>>
> >>>>>> Clearly pci core knows the resource is associated with the device, but
> >>>>>> I don't think we're tapping into that with request_resource(), we're
> >>>>>> just potentially stealing resources that another driver might have
> >>>>>> claimed otherwise as I described above. That's my suspicion at
> >>>>>> least, feel free to show otherwise if it's incorrect. Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >>>>> Thanks for your explanation. But I still have one question.
> >>>>> Shouldn't PCI core have claimed all PCI device's resources
> >>>>> after probing those devices. If so, request_resource() will fail
> >>>>> when vfio-pci try to steal resources that another driver might
> >>>>> request later. Anything I missed here? Some device resources
> >>>>> would not be claimed in PCI core?
> >>>> Hi Yongji,
> >>>>
> >>>> I don't know what to say, this is not how the interface currently
> >>>> works. request_resource() is a driver level interface that tries to
> >>>> prevent drivers from claiming conflicting resources. In this patch
> >>>> you're trying to use it to probe whether a resource maps to another
> >>>> device. This is not what it does. request_resource() will happily let
> >>>> you claim any resource you want, so long as nobody else claimed it
> >>>> first. So the only case where the assumptions in this patch are valid
> >>>> is if we can guarantee that any potentially conflicting device has a
> >>>> driver loaded that has claimed those resources. As it is here,
> >>>> vfio-pci will happily attempt to request resources that might overlap
> >>>> with another device and might break drivers that haven't yet had a
> >>>> chance to probe their devices. I don't think that's acceptable.
> >>>> Thanks,
> >>>>
> >>>> Alex
> >>>>
> >>> I'm trying to get your point. Let me give an example here.
> >>> I'm not sure whether my understanding is right. Please
> >>> point it out if I'm wrong.
> >>>
> >>> We assume that there are two PCI devices like this:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>> 240000000000-2400ffffffff : PCI Bus 0002:01
> >>> 240000000000-240000007fff : 0002:01:00.0
> >>> 240000000000-240000007fff : vfio-pci
> >>> 240000008000-24000000ffff : 0002:01:01.0
> >>> 240000008000-24000000ffff : lpfc
> >>>
> >>> Do you mean vfio-pci driver will succeed in requesting
> >>> dummy_res: [240000008000-24000000ffff] (PAGE_SIZE is 64K)
> >>> if it is loaded before lpfc driver? Like this:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>> 240000000000-2400ffffffff : PCI Bus 0002:01
> >>> 240000000000-240000007fff : 0002:01:00.0
> >>> 240000000000-240000007fff : vfio-pci
> >>> 240000008000-24000000ffff : 0002:01:01.0
> >>> 240000008000-24000000ffff : <BAD> --> vfio-pci call
> >>> request_resource()
> >>>
> >>> Then lpfc driver will fail when it attempts to call
> >>> pci_request_regions() later.
> >> Yes, that is my supposition.
> >>
> >>> But is it possible that the dummy_res become the child of
> >>> the res: 0002:01:01.0? Wouldn't request_resource() fail when
> >>> it found parent res: PCI Bus 0002:01 already have conflict
> >>> child res: 0002:01:01.0.
> >>>
> >>> And I think the case that request_resource() will succeed
> >>> should like this:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>> 240000000000-2400ffffffff : PCI Bus 0002:01
> >>> 240000000000-240000007fff : 0002:01:00.0
> >>> 240000010000-240000017fff : 0002:01:01.0
> >>>
> >>> There is a mem hole: [240000008000-24000000ffff] after
> >>> PCI probing. After loading drivers, the resources tree
> >>> will be:
> >>>
> >>> 240000000000-24feffffffff : /pciex@3fffe40400000
> >>> 240000000000-2400ffffffff : PCI Bus 0002:01
> >>> 240000000000-240000007fff : 0002:01:00.0
> >>> 240000000000-240000007fff : vfio-pci
> >>> 240000008000-24000000ffff : <BAD> ---> vfio-pci call
> >>> request_resource()
> >>> 240000010000-240000017fff : 0002:01:01.0
> >>> 240000010000-240000017fff : lpfc
> >> Ok, let's stop guessing about this. I modified your patch as follows
> >> so I could easily test it on a 4k host:
> >>
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -110,6 +110,9 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> >> return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> >> }
> >>
> >> +#define VFIO_64K_PAGE_SIZE (64*1024)
> >> +#define VFIO_64K_PAGE_MASK (~(VFIO_64K_PAGE_SIZE-1))
> >> +
> >> static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >> {
> >> struct resource *res;
> >> @@ -135,12 +138,13 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >> if (!resource_size(res))
> >> goto no_mmap;
> >>
> >> - if (resource_size(res) >= PAGE_SIZE) {
> >> + if (resource_size(res) >= VFIO_64K_PAGE_SIZE) {
> >> vdev->bar_mmap_supported[bar] = true;
> >> continue;
> >> }
> >>
> >> - if (!(res->start & ~PAGE_MASK)) {
> >> + if (!(res->start & ~VFIO_64K_PAGE_MASK)) {
> >> + int ret;
> >> /*
> >> * Add a dummy resource to reserve the remainder
> >> * of the exclusive page in case that hot-add
> >> @@ -151,10 +155,12 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >> goto no_mmap;
> >>
> >> dummy_res->resource.start = res->end + 1;
> >> - dummy_res->resource.end = res->start + PAGE_SIZE - 1;
> >> + dummy_res->resource.end = res->start + VFIO_64K_PAGE_SIZE - 1;
> >> dummy_res->resource.flags = res->flags;
> >> - if (request_resource(res->parent,
> >> - &dummy_res->resource)) {
> >> + ret = request_resource(res->parent,
> >> + &dummy_res->resource);
> >> + if (ret) {
> >> +dev_info(&vdev->pdev->dev, "Failed to request_resource %lx-%lx (%d)\n", dummy_res->resource.start, dummy_res->resource.end, ret);
> >> kfree(dummy_res);
> >> goto no_mmap;
> >> }
> >>
> >> IOW, enforce 64k for mmap regardless of PAGE_SIZE. Then I find a
> >> system configuration to test it:
> >>
> >> ee400000-ef4fffff : PCI Bus 0000:07
> >> ef480000-ef49ffff : 0000:07:00.0
> >> ef480000-ef483fff : 0000:08:10.0
> >> ef484000-ef487fff : 0000:08:10.2
> >> ef488000-ef48bfff : 0000:08:10.4
> >> ef48c000-ef48ffff : 0000:08:10.6
> >> ef490000-ef493fff : 0000:08:11.0
> >> ef494000-ef497fff : 0000:08:11.2
> >> ef498000-ef49bfff : 0000:08:11.4
> >> ef4a0000-ef4bffff : 0000:07:00.0
> >> ef4a0000-ef4a3fff : 0000:08:10.0
> >> ef4a4000-ef4a7fff : 0000:08:10.2
> >> ef4a8000-ef4abfff : 0000:08:10.4
> >> ef4ac000-ef4affff : 0000:08:10.6
> >> ef4b0000-ef4b3fff : 0000:08:11.0
> >> ef4b4000-ef4b7fff : 0000:08:11.2
> >> ef4b8000-ef4bbfff : 0000:08:11.4
> >>
> >> This is an 82576 NIC where each VF has two 16k BARs (0 & 3), where all
> >> the VF BAR0s are in a contiguous range and all the VF BAR3s are in a
> >> separate contiguous range. The igbvf driver is not loaded, so the
> >> other resources are free to be claimed, what happens?
> >>
> >> It looks like you're right, the request_resource() fails with:
> >>
> >> vfio-pci 0000:08:10.0: Failed to request_resource ef4a4000-ef4affff (-16)
> >> vfio-pci 0000:08:10.0: Failed to request_resource ef484000-ef48ffff (-16)
> >>
> >> So we get back -EBUSY which means we hit a conflict. I would have
> >> expected that this means our res->parent that we're using for
> >> request_resource() is only, for instance, ef480000-ef483fff (ie. the
> >> BAR itself) thus our request for ef484000-ef48ffff exceeds the end of
> >> the parent, but adding the parent resource range to the dev_info(), it
> >> actually shows the range being ef480000-ef49ffff, so the parent is
> >> actually the 07:00.0 resource. In fact, we can't even use
> >> request_resource() like this to claim the BAR itself, which is why we
> >> use pci_request_selected_regions(), which allows conflicts, putting the
> >> requested resource at the leaf of the tree.
> >>
> >> So I guess I retract this concern about the use of request_resource(),
> >> it does seem to behave as intended. Unless I can spot anything else or
> >> other reviewers have comments, I'll queue this into my next branch for
> >> v4.8. Thanks,
> >
> > Ok, one more test, I found that I have access to the following USB
> > devices:
> >
> > 00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05) (prog-if 20 [EHCI])
> > Region 0: Memory at f7a08000 (32-bit, non-prefetchable) [size=1K]
> >
> > 00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05) (prog-if 20 [EHCI])
> > Region 0: Memory at f7a07000 (32-bit, non-prefetchable) [size=1K]
> >
> > These are nicely mapped such that vfio-pci can claim the residual space
> > from the page, which results in the following in /proc/iomem:
> >
> > f7a07000-f7a073ff : 0000:00:1d.0
> > f7a07000-f7a073ff : vfio
> > f7a07400-f7a07fff : <BAD>
> > f7a08000-f7a083ff : 0000:00:1a.0
> > f7a08000-f7a083ff : vfio
> > f7a08400-f7a08fff : <BAD>
> >
> > I should have looked more closely at your previous reply, I didn't
> > think that "<BAD>" was literally the owner of these dummy resources.
> > Please fix this to report something that isn't going frighten users
> > and make small children cry. Thanks,
>
> Yeah, I also noticed that:-). Now I'm trying to find a proper
> name. What do you think about "vfio-pci (dummy)"?

How about "vfio sub-page reserved"? Thanks,

Alex