Re: [PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()
From: Peter Xu
Date: Wed Dec 02 2020 - 10:46:44 EST
Hi, Stefan,
On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:
> On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> > On Wed, Nov 25, 2020 at 01:05:25AM +0000, Justin He wrote:
> > > > I'd appreciate if you could explain why vfio needs to dma map some
> > > > PROT_NONE
> > >
> > > Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
> > > region of that cache window with read or write permission. I guess this might
> > > be an security concern. Just CC virtiofs expert Stefan to answer it more accurately.
> >
> > Yep. Since my previous sentence was cut off, I'll rephrase: I was thinking
> > whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.
>
> Userspace processes sometimes use PROT_NONE to reserve virtual address
> space. That way future mmap(NULL, ...) calls will not accidentally
> allocate an address from the reserved range.
>
> virtio-fs needs to do this because the DAX window mappings change at
> runtime. Initially the entire DAX window is just reserved using
> PROT_NONE. When it's time to mmap a portion of a file into the DAX
> window an mmap(fixed_addr, ...) call will be made.
Yes I can understand the rational on why the region is reserved. However IMHO
the real question is why such reservation behavior should affect qemu memory
layout, and even further to VFIO mappings.
Note that PROT_NONE should likely mean that there's no backing page at all in
this case. Since vfio will pin all the pages before mapping the DMAs, it also
means that it's at least inefficient, because when we try to map all the
PROT_NONE pages we'll try to fault in every single page of it, even if they may
not ever be used.
So I still think this patch is not doing the right thing. Instead we should
somehow teach qemu that the virtiofs memory region should only be the size of
enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
PROT_NONE region.
Thanks,
--
Peter Xu