Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address

From: Michael S. Tsirkin
Date: Sun Dec 30 2018 - 13:30:29 EST

Next message: Sinan Kaya: "Re: [PATCH] dmaengine: qcom_hidma: Check for driver register failure"
Previous message: James Bottomley: "Re: [PATCH v8 01/25] scsi/atari_scsi: Don't select CONFIG_NVRAM"
In reply to: Jason Wang: "Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Dec 27, 2018 at 05:39:21PM +0800, Jason Wang wrote:
>
> On 2018/12/26 äå11:02, Michael S. Tsirkin wrote:
> > On Wed, Dec 26, 2018 at 11:57:32AM +0800, Jason Wang wrote:
> > > On 2018/12/25 äå8:50, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 25, 2018 at 06:05:25PM +0800, Jason Wang wrote:
> > > > > On 2018/12/25 äå2:10, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 24, 2018 at 03:53:16PM +0800, Jason Wang wrote:
> > > > > > > On 2018/12/14 äå8:36, Michael S. Tsirkin wrote:
> > > > > > > > On Fri, Dec 14, 2018 at 11:57:35AM +0800, Jason Wang wrote:
> > > > > > > > > On 2018/12/13 äå11:44, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> > > > > > > > > > > It was noticed that the copy_user() friends that was used to access
> > > > > > > > > > > virtqueue metdata tends to be very expensive for dataplane
> > > > > > > > > > > implementation like vhost since it involves lots of software check,
> > > > > > > > > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > > > > > > > > extra cost will be more obvious when transferring small packets.
> > > > > > > > > > >
> > > > > > > > > > > This patch tries to eliminate those overhead by pin vq metadata pages
> > > > > > > > > > > and access them through vmap(). During SET_VRING_ADDR, we will setup
> > > > > > > > > > > those mappings and memory accessors are modified to use pointers to
> > > > > > > > > > > access the metadata directly.
> > > > > > > > > > >
> > > > > > > > > > > Note, this was only done when device IOTLB is not enabled. We could
> > > > > > > > > > > use similar method to optimize it in the future.
> > > > > > > > > > >
> > > > > > > > > > > Tests shows about ~24% improvement on TX PPS when using virtio-user +
> > > > > > > > > > > vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
> > > > > > > > > > >
> > > > > > > > > > > Before: ~5.0Mpps
> > > > > > > > > > > After: ~6.1Mpps
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jason Wang<jasowang@xxxxxxxxxx>
> > > > > > > > > > > ---
> > > > > > > > > > > drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > > drivers/vhost/vhost.h | 11 +++
> > > > > > > > > > > 2 files changed, 189 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > > > > > > > index bafe39d2e637..1bd24203afb6 100644
> > > > > > > > > > > --- a/drivers/vhost/vhost.c
> > > > > > > > > > > +++ b/drivers/vhost/vhost.c
> > > > > > > > > > > @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > > > > > > > > > vq->indirect = NULL;
> > > > > > > > > > > vq->heads = NULL;
> > > > > > > > > > > vq->dev = dev;
> > > > > > > > > > > + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> > > > > > > > > > > + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> > > > > > > > > > > + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> > > > > > > > > > > mutex_init(&vq->mutex);
> > > > > > > > > > > vhost_vq_reset(dev, vq);
> > > > > > > > > > > if (vq->handle_kick)
> > > > > > > > > > > @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > > > > > > > > > spin_unlock(&dev->iotlb_lock);
> > > > > > > > > > > }
> > > > > > > > > > > +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> > > > > > > > > > > + size_t size, int write)
> > > > > > > > > > > +{
> > > > > > > > > > > + struct page **pages;
> > > > > > > > > > > + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> > > > > > > > > > > + int npinned;
> > > > > > > > > > > + void *vaddr;
> > > > > > > > > > > +
> > > > > > > > > > > + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> > > > > > > > > > > + if (!pages)
> > > > > > > > > > > + return -ENOMEM;
> > > > > > > > > > > +
> > > > > > > > > > > + npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > > > > > > > > > + if (npinned != npages)
> > > > > > > > > > > + goto err;
> > > > > > > > > > > +
> > > > > > > > > > As I said I have doubts about the whole approach, but this
> > > > > > > > > > implementation in particular isn't a good idea
> > > > > > > > > > as it keeps the page around forever.
> > > > > > > The pages wil be released during set features.
> > > > > > >
> > > > > > >
> > > > > > > > > > So no THP, no NUMA rebalancing,
> > > > > > > For THP, we will probably miss 2 or 4 pages, but does this really matter
> > > > > > > consider the gain we have?
> > > > > > We as in vhost? networking isn't the only thing guest does.
> > > > > > We don't even know if this guest does a lot of networking.
> > > > > > You don't
> > > > > > know what else is in this huge page. Can be something very important
> > > > > > that guest touches all the time.
> > > > > Well, the probability should be very small consider we usually give several
> > > > > gigabytes to guest. The rest of the pages that doesn't sit in the same
> > > > > hugepage with metadata can still be merged by THP.Â Anyway, I can test the
> > > > > differences.
> > > > Thanks!
> > > >
> > > > > > > For NUMA rebalancing, I'm even not quite sure if
> > > > > > > it can helps for the case of IPC (vhost). It looks to me the worst case it
> > > > > > > may cause page to be thrash between nodes if vhost and userspace are running
> > > > > > > in two nodes.
> > > > > > So again it's a gain for vhost but has a completely unpredictable effect on
> > > > > > other functionality of the guest.
> > > > > >
> > > > > > That's what bothers me with this approach.
> > > > > So:
> > > > >
> > > > > - The rest of the pages could still be balanced to other nodes, no?
> > > > >
> > > > > - try to balance metadata pages (belongs to co-operate processes) itself is
> > > > > still questionable
> > > > I am not sure why. It should be easy enough to force the VCPU and vhost
> > > > to move (e.g. start them pinned to 1 cpu, then pin them to another one).
> > > > Clearly sometimes this would be necessary for load balancing reasons.
> > >
> > > Yes, but it looks to me the part of motivation of auto NUMA is to avoid
> > > manual pinning.
> > ... of memory. Yes.
> >
> >
> > > > With autonuma after a while (could take seconds but it will happen) the
> > > > memory will migrate.
> > > >
> > > Yes. As you mentioned during the discuss, I wonder we could do it similarly
> > > through mmu notifier like APIC access page in commit c24ae0dcd3e ("kvm: x86:
> > > Unpin and remove kvm_arch->apic_access_page")
> > That would be a possible approach.
>
>
> Yes, this looks possible, and the conversion seems not hard. Let me have a
> try with this.
>
>
> [...]
>
>
> > > > > > > > I don't see how a kthread makes any difference. We do have a validation
> > > > > > > > step which makes some difference.
> > > > > > > The problem is not kthread but the address of userspace address. The
> > > > > > > addresses of vq metadata tends to be consistent for a while, and vhost knows
> > > > > > > they will be frequently. SMAP doesn't help too much in this case.
> > > > > > >
> > > > > > > Thanks.
> > > > > > It's true for a real life applications but a malicious one
> > > > > > can call the setup ioctls any number of times. And SMAP is
> > > > > > all about malcious applications.
> > > > > We don't do this in the path of ioctl, there's no context switch between
> > > > > userspace and kernel in the worker thread. SMAP is used to prevent kernel
> > > > > from accessing userspace pages unexpectedly which is not the case for
> > > > > metadata access.
> > > > >
> > > > > Thanks
> > > > OK let's forget smap for now.
> > >
> > > Some numbers I measured:
> > >
> > > On an old Sandy bridge machine without SMAP support. Remove speculation
> > > barrier boost the performance from 4.6Mpps to 5.1Mpps
> > >
> > > On a newer Broadwell machine with SMAP support. Remove speculation barrier
> > > only gives 2%-5% improvement, disable SMAP completely through Kconfig boost
> > > 57% performance from 4.8Mpps to 7.5Mpps. (Vmap gives 6Mpps - 6.1Mpps, it
> > > only bypass SMAP for metadata).
> > >
> > > So it looks like for recent machine, SMAP becomes pain point when the copy
> > > is short (e.g 64B) for high PPS.
> > >
> > > Thanks
> > Thanks a lot for looking into this!
> >
> > So first of all users can just boot with nosmap, right?
> > What's wrong with that?
>
>
> Nothing wrong, just realize we had this kernel parameter.
>
>
> > Yes it's not fine-grained but OTOH
> > it's easy to understand.
> >
> > And I guess this confirms that if we are going to worry
> > about smap enabled, we need to look into packet copies
> > too, not just meta-data.
>
>
> For packet copies, we can do batch copy which is pretty simple for the case
> of XDP. I've already had patches for this.
>
>
> >
> > Vaguely could see a module option (off by default)
> > where vhost basically does user_access_begin
> > when it starts running, then uses unsafe accesses
> > in vhost and tun and then user_access_end.
>
>
> Using user_access_begin() is more tricky than imaged. E.g it requires:
>
> - userspace address to be validated before through access_ok() [1]

This part is fine I think - addresses come from the memory
map and when userspace supplies the memory map
we validate everything with access_ok.
Well do we validate with the iotlb too? Don't see it right now
so maybe not but it's easy to add.

> - It doesn't support calling a function that does explicit schedule since
> SMAP/PAN state is not maintained through schedule() [2]
>
> [1] https://lwn.net/Articles/736348/
>
> [2] https://lkml.org/lkml/2018/11/23/430
>
> So calling user_access_begin() all the time when vhost is running seems
> pretty dangerous.

Yes it requires some rework e.g. to try getting memory with
GFP_ATOMIC. We could then do a slow path with GFP_KERNEL
if that fails.

> For a better batched datacopy, I tend to build not only XDP but also skb in
> vhost in the future.
>
> Thanks

Sure, why not.

>
> >
> >
> > > > > > > > > Packet or AF_XDP benefit from
> > > > > > > > > accessing metadata directly, we should do it as well.
> > > > > > > > >
> > > > > > > > > Thanks

Next message: Sinan Kaya: "Re: [PATCH] dmaengine: qcom_hidma: Check for driver register failure"
Previous message: James Bottomley: "Re: [PATCH v8 01/25] scsi/atari_scsi: Don't select CONFIG_NVRAM"
In reply to: Jason Wang: "Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]