Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address

From: Michael S. Tsirkin
Date: Mon Sep 09 2019 - 00:46:03 EST


On Mon, Sep 09, 2019 at 10:18:57AM +0800, Jason Wang wrote:
>
> On 2019/9/8 äå7:05, Michael S. Tsirkin wrote:
> > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> > > This is a rework on the commit 7f466032dc9e ("vhost: access vq
> > > metadata through kernel virtual address").
> > >
> > > It was noticed that the copy_to/from_user() friends that was used to
> > > access virtqueue metdata tends to be very expensive for dataplane
> > > implementation like vhost since it involves lots of software checks,
> > > speculation barriers,
> > So if we drop speculation barrier,
> > there's a problem here in access will now be speculated.
> > This effectively disables the defence in depth effect of
> > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
> > x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec
> >
> >
> > So now we need to sprinkle array_index_nospec or barrier_nospec over the
> > code whenever we use an index we got from userspace.
> > See below for some examples.
> >
> >
> > > hardware feature toggling (e.g SMAP). The
> > > extra cost will be more obvious when transferring small packets since
> > > the time spent on metadata accessing become more significant.
> > >
> > > This patch tries to eliminate those overheads by accessing them
> > > through direct mapping of those pages. Invalidation callbacks is
> > > implemented for co-operation with general VM management (swap, KSM,
> > > THP or NUMA balancing). We will try to get the direct mapping of vq
> > > metadata before each round of packet processing if it doesn't
> > > exist. If we fail, we will simplely fallback to copy_to/from_user()
> > > friends.
> > >
> > > This invalidation, direct mapping access and set are synchronized
> > > through spinlock. This takes a step back from the original commit
> > > 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> > > address") which tries to RCU which is suspicious and hard to be
> > > reviewed. This won't perform as well as RCU because of the atomic,
> > > this could be addressed by the future optimization.
> > >
> > > This method might does not work for high mem page which requires
> > > temporary mapping so we just fallback to normal
> > > copy_to/from_user() and may not for arch that has virtual tagged cache
> > > since extra cache flushing is needed to eliminate the alias. This will
> > > result complex logic and bad performance. For those archs, this patch
> > > simply go for copy_to/from_user() friends. This is done by ruling out
> > > kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> > >
> > > Note that this is only done when device IOTLB is not enabled. We
> > > could use similar method to optimize IOTLB in the future.
> > >
> > > Tests shows at most about 22% improvement on TX PPS when using
> > > virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> > >
> > > SMAP on | SMAP off
> > > Before: 4.9Mpps | 6.9Mpps
> > > After: 6.0Mpps | 7.5Mpps
> > >
> > > On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> > > any difference.
> > Why is not Kaby Lake with SMAP off the same as Sandy Bridge?
>
>
> I don't know, I guess it was because the atomic is l
>
>
> >
> >
> > > Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> > > Cc: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
> > > Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
> > > Cc: David Miller <davem@xxxxxxxxxxxxx>
> > > Cc: Jerome Glisse <jglisse@xxxxxxxxxx>
> > > Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
> > > Cc: linux-mm@xxxxxxxxx
> > > Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx
> > > Cc: linux-parisc@xxxxxxxxxxxxxxx
> > > Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx>
> > > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> > > ---
> > > drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++-
> > > drivers/vhost/vhost.h | 41 ++++
> > > 2 files changed, 589 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 791562e03fe0..f98155f28f02 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
> > > __vhost_vq_meta_reset(d->vqs[i]);
> > > }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_map_unprefetch(struct vhost_map *map)
> > > +{
> > > + kfree(map->pages);
> > > + kfree(map);
> > > +}
> > > +
> > > +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> > > + struct vhost_map *map, int index)
> > > +{
> > > + struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > + int i;
> > > +
> > > + if (uaddr->write) {
> > > + for (i = 0; i < map->npages; i++)
> > > + set_page_dirty(map->pages[i]);
> > > + }
> > > +}
> > > +
> > > +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > + struct vhost_map *map[VHOST_NUM_ADDRS];
> > > + int i;
> > > +
> > > + spin_lock(&vq->mmu_lock);
> > > + for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> > > + map[i] = vq->maps[i];
> > > + if (map[i]) {
> > > + vhost_set_map_dirty(vq, map[i], i);
> > > + vq->maps[i] = NULL;
> > > + }
> > > + }
> > > + spin_unlock(&vq->mmu_lock);
> > > +
> > > + /* No need for synchronization since we are serialized with
> > > + * memory accessors (e.g vq mutex held).
> > > + */
> > > +
> > > + for (i = 0; i < VHOST_NUM_ADDRS; i++)
> > > + if (map[i])
> > > + vhost_map_unprefetch(map[i]);
> > > +
> > > +}
> > > +
> > > +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
> > > +{
> > > + int i;
> > > +
> > > + vhost_uninit_vq_maps(vq);
> > > + for (i = 0; i < VHOST_NUM_ADDRS; i++)
> > > + vq->uaddrs[i].size = 0;
> > > +}
> > > +
> > > +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> > > + unsigned long start,
> > > + unsigned long end)
> > > +{
> > > + if (unlikely(!uaddr->size))
> > > + return false;
> > > +
> > > + return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> > > +}
> > > +
> > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> > > +{
> > > + spin_lock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> > > +{
> > > + spin_unlock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> > > + int index,
> > > + unsigned long start,
> > > + unsigned long end,
> > > + bool blockable)
> > > +{
> > > + struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > + struct vhost_map *map;
> > > +
> > > + if (!vhost_map_range_overlap(uaddr, start, end))
> > > + return 0;
> > > + else if (!blockable)
> > > + return -EAGAIN;
> > > +
> > > + spin_lock(&vq->mmu_lock);
> > > + ++vq->invalidate_count;
> > > +
> > > + map = vq->maps[index];
> > > + if (map)
> > > + vq->maps[index] = NULL;
> > > + spin_unlock(&vq->mmu_lock);
> > > +
> > > + if (map) {
> > > + vhost_set_map_dirty(vq, map, index);
> > > + vhost_map_unprefetch(map);
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
> > > + int index,
> > > + unsigned long start,
> > > + unsigned long end)
> > > +{
> > > + if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
> > > + return;
> > > +
> > > + spin_lock(&vq->mmu_lock);
> > > + --vq->invalidate_count;
> > > + spin_unlock(&vq->mmu_lock);
> > > +}
> > > +
> > > +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
> > > + const struct mmu_notifier_range *range)
> > > +{
> > > + struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> > > + mmu_notifier);
> > > + bool blockable = mmu_notifier_range_blockable(range);
> > > + int i, j, ret;
> > > +
> > > + for (i = 0; i < dev->nvqs; i++) {
> > > + struct vhost_virtqueue *vq = dev->vqs[i];
> > > +
> > > + for (j = 0; j < VHOST_NUM_ADDRS; j++) {
> > > + ret = vhost_invalidate_vq_start(vq, j,
> > > + range->start,
> > > + range->end, blockable);
> > > + if (ret)
> > > + return ret;
> > > + }
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void vhost_invalidate_range_end(struct mmu_notifier *mn,
> > > + const struct mmu_notifier_range *range)
> > > +{
> > > + struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> > > + mmu_notifier);
> > > + int i, j;
> > > +
> > > + for (i = 0; i < dev->nvqs; i++) {
> > > + struct vhost_virtqueue *vq = dev->vqs[i];
> > > +
> > > + for (j = 0; j < VHOST_NUM_ADDRS; j++)
> > > + vhost_invalidate_vq_end(vq, j,
> > > + range->start,
> > > + range->end);
> > > + }
> > > +}
> > > +
> > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > + .invalidate_range_start = vhost_invalidate_range_start,
> > > + .invalidate_range_end = vhost_invalidate_range_end,
> > > +};
> > > +
> > > +static void vhost_init_maps(struct vhost_dev *dev)
> > > +{
> > > + struct vhost_virtqueue *vq;
> > > + int i, j;
> > > +
> > > + dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> > > +
> > > + for (i = 0; i < dev->nvqs; ++i) {
> > > + vq = dev->vqs[i];
> > > + for (j = 0; j < VHOST_NUM_ADDRS; j++)
> > > + vq->maps[j] = NULL;
> > > + }
> > > +}
> > > +#endif
> > > +
> > > static void vhost_vq_reset(struct vhost_dev *dev,
> > > struct vhost_virtqueue *vq)
> > > {
> > > @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > > vq->busyloop_timeout = 0;
> > > vq->umem = NULL;
> > > vq->iotlb = NULL;
> > > + vq->invalidate_count = 0;
> > > __vhost_vq_meta_reset(vq);
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + vhost_reset_vq_maps(vq);
> > > +#endif
> > > }
> > > static int vhost_worker(void *data)
> > > @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > dev->iov_limit = iov_limit;
> > > dev->weight = weight;
> > > dev->byte_weight = byte_weight;
> > > + dev->has_notifier = false;
> > > init_llist_head(&dev->work_list);
> > > init_waitqueue_head(&dev->wait);
> > > INIT_LIST_HEAD(&dev->read_list);
> > > INIT_LIST_HEAD(&dev->pending_list);
> > > spin_lock_init(&dev->iotlb_lock);
> > > -
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + vhost_init_maps(dev);
> > > +#endif
> > > for (i = 0; i < dev->nvqs; ++i) {
> > > vq = dev->vqs[i];
> > > @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> > > vq->heads = NULL;
> > > vq->dev = dev;
> > > mutex_init(&vq->mutex);
> > > + spin_lock_init(&vq->mmu_lock);
> > > vhost_vq_reset(dev, vq);
> > > if (vq->handle_kick)
> > > vhost_poll_init(&vq->poll, vq->handle_kick,
> > > @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> > > if (err)
> > > goto err_cgroup;
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> > > + if (err)
> > > + goto err_mmu_notifier;
> > > +#endif
> > > + dev->has_notifier = true;
> > > +
> > > return 0;
> > > +
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +err_mmu_notifier:
> > > + vhost_dev_free_iovecs(dev);
> > > +#endif
> > > err_cgroup:
> > > kthread_stop(worker);
> > > dev->worker = NULL;
> > > @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> > > spin_unlock(&dev->iotlb_lock);
> > > }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > +static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
> > > + int index, unsigned long uaddr,
> > > + size_t size, bool write)
> > > +{
> > > + struct vhost_uaddr *addr = &vq->uaddrs[index];
> > > +
> > > + addr->uaddr = uaddr;
> > > + addr->size = size;
> > > + addr->write = write;
> > > +}
> > > +
> > > +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
> > > +{
> > > + vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
> > > + (unsigned long)vq->desc,
> > > + vhost_get_desc_size(vq, vq->num),
> > > + false);
> > > + vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL,
> > > + (unsigned long)vq->avail,
> > > + vhost_get_avail_size(vq, vq->num),
> > > + false);
> > > + vhost_setup_uaddr(vq, VHOST_ADDR_USED,
> > > + (unsigned long)vq->used,
> > > + vhost_get_used_size(vq, vq->num),
> > > + true);
> > > +}
> > > +
> > > +static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> > > + int index)
> > > +{
> > > + struct vhost_map *map;
> > > + struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> > > + struct page **pages;
> > > + int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE);
> > > + int npinned;
> > > + void *vaddr, *v;
> > > + int err;
> > > + int i;
> > > +
> > > + spin_lock(&vq->mmu_lock);
> > > +
> > > + err = -EFAULT;
> > > + if (vq->invalidate_count)
> > > + goto err;
> > > +
> > > + err = -ENOMEM;
> > > + map = kmalloc(sizeof(*map), GFP_ATOMIC);
> > > + if (!map)
> > > + goto err;
> > > +
> > > + pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
> > > + if (!pages)
> > > + goto err_pages;
> > > +
> > > + err = EFAULT;
> > > + npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> > > + uaddr->write, pages);
> > > + if (npinned > 0)
> > > + release_pages(pages, npinned);
> > > + if (npinned != npages)
> > > + goto err_gup;
> > > +
> > > + for (i = 0; i < npinned; i++)
> > > + if (PageHighMem(pages[i]))
> > > + goto err_gup;
> > > +
> > > + vaddr = v = page_address(pages[0]);
> > > +
> > > + /* For simplicity, fallback to userspace address if VA is not
> > > + * contigious.
> > > + */
> > > + for (i = 1; i < npinned; i++) {
> > > + v += PAGE_SIZE;
> > > + if (v != page_address(pages[i]))
> > > + goto err_gup;
> > > + }
> > > +
> > > + map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
> > > + map->npages = npages;
> > > + map->pages = pages;
> > > +
> > > + vq->maps[index] = map;
> > > + /* No need for a synchronize_rcu(). This function should be
> > > + * called by dev->worker so we are serialized with all
> > > + * readers.
> > > + */
> > > + spin_unlock(&vq->mmu_lock);
> > > +
> > > + return 0;
> > > +
> > > +err_gup:
> > > + kfree(pages);
> > > +err_pages:
> > > + kfree(map);
> > > +err:
> > > + spin_unlock(&vq->mmu_lock);
> > > + return err;
> > > +}
> > > +#endif
> > > +
> > > void vhost_dev_cleanup(struct vhost_dev *dev)
> > > {
> > > int i;
> > > @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > > kthread_stop(dev->worker);
> > > dev->worker = NULL;
> > > }
> > > - if (dev->mm)
> > > + if (dev->mm) {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + if (dev->has_notifier) {
> > > + mmu_notifier_unregister(&dev->mmu_notifier,
> > > + dev->mm);
> > > + dev->has_notifier = false;
> > > + }
> > > +#endif
> > > mmput(dev->mm);
> > > + }
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + for (i = 0; i < dev->nvqs; i++)
> > > + vhost_uninit_vq_maps(dev->vqs[i]);
> > > +#endif
> > > dev->mm = NULL;
> > > }
> > > EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> > > @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > > static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_used *used;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_USED];
> > > + if (likely(map)) {
> > > + used = map->addr;
> > > + *((__virtio16 *)&used->ring[vq->num]) =
> > > + cpu_to_vhost16(vq, vq->avail_idx);
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> > > vhost_avail_event(vq));
> > > }
> > > @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> > > struct vring_used_elem *head, int idx,
> > > int count)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_used *used;
> > > + size_t size;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_USED];
> > > + if (likely(map)) {
> > > + used = map->addr;
> > > + size = count * sizeof(*head);
> > > + memcpy(used->ring + idx, head, size);
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> > > count * sizeof(*head));
> > > }
> > > @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> > > static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_used *used;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_USED];
> > > + if (likely(map)) {
> > > + used = map->addr;
> > > + used->flags = cpu_to_vhost16(vq, vq->used_flags);
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> > > &vq->used->flags);
> > > }
> > > @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> > > static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_used *used;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_USED];
> > > + if (likely(map)) {
> > > + used = map->addr;
> > > + used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> > > &vq->used->idx);
> > > }
> > > @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
> > > static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > > __virtio16 *idx)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_avail *avail;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_AVAIL];
> > > + if (likely(map)) {
> > > + avail = map->addr;
> > > + *idx = avail->idx;
> > index can now be speculated.
>
> [...]
>
>
> > + vhost_vq_access_map_begin(vq);
> > +
> > + map = vq->maps[VHOST_ADDR_AVAIL];
> > + if (likely(map)) {
> > + avail = map->addr;
> > + *head = avail->ring[idx & (vq->num - 1)];
> >
> > Since idx can be speculated, I guess we need array_index_nospec here?
>
>
> So we have
>
> ACQUIRE(mmu_lock)
>
> get idx
>
> RELEASE(mmu_lock)
>
> ACQUIRE(mmu_lock)
>
> read array[idx]
>
> RELEASE(mmu_lock)
>
> Then I think idx can't be speculated consider we've passed RELEASE +
> ACQUIRE?

I don't think memory barriers have anything to do with speculation,
they are architectural.

>
> >
> >
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_get_avail(vq, *head,
> > > &vq->avail->ring[idx & (vq->num - 1)]);
> > > }
> > > @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> > > static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > > __virtio16 *flags)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_avail *avail;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_AVAIL];
> > > + if (likely(map)) {
> > > + avail = map->addr;
> > > + *flags = avail->flags;
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_get_avail(vq, *flags, &vq->avail->flags);
> > > }
> > > static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> > > __virtio16 *event)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_avail *avail;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > + map = vq->maps[VHOST_ADDR_AVAIL];
> > > + if (likely(map)) {
> > > + avail = map->addr;
> > > + *event = (__virtio16)avail->ring[vq->num];
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_get_avail(vq, *event, vhost_used_event(vq));
> > > }
> > > static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> > > __virtio16 *idx)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_used *used;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_USED];
> > > + if (likely(map)) {
> > > + used = map->addr;
> > > + *idx = used->idx;
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_get_used(vq, *idx, &vq->used->idx);
> > > }
> >
> > This seems to be used during init. Why do we bother
> > accelerating this?
>
>
> Ok, I can remove this part in next version.
>
>
> >
> >
> > > static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> > > struct vring_desc *desc, int idx)
> > > {
> > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> > > + struct vhost_map *map;
> > > + struct vring_desc *d;
> > > +
> > > + if (!vq->iotlb) {
> > > + vhost_vq_access_map_begin(vq);
> > > +
> > > + map = vq->maps[VHOST_ADDR_DESC];
> > > + if (likely(map)) {
> > > + d = map->addr;
> > > + *desc = *(d + idx);
> >
> > Since idx can be speculated, I guess we need array_index_nospec here?
>
>
> This is similar to the above avail idx case.
>
>
> >
> >
> > > + vhost_vq_access_map_end(vq);
> > > + return 0;
> > > + }
> > > +
> > > + vhost_vq_access_map_end(vq);
> > > + }
> > > +#endif
> > > +
> > > return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> > > }
> > I also wonder about the userspace address we get eventualy.
> > It would seem that we need to prevent that from speculating -
> > and that seems like a good idea even if this patch isn't
> > applied. As you are playing with micro-benchmarks, maybe
> > you could the below patch?
>
>
> Let me test it.
>
> Thanks
>
>
> > It's unfortunately untested.
> > Thanks a lot in advance!
> >
> > ===>
> > vhost: block speculation of translated descriptors
> >
> > iovec addresses coming from vhost are assumed to be
> > pre-validated, but in fact can be speculated to a value
> > out of range.
> >
> > Userspace address are later validated with array_index_nospec so we can
> > be sure kernel info does not leak through these addresses, but vhost
> > must also not leak userspace info outside the allowed memory table to
> > guests.
> >
> > Following the defence in depth principle, make sure
> > the address is not validated out of node range.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> >
> > ---
> >
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 5dc174ac8cac..863e25011ef6 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
> > size = node->size - addr + node->start;
> > _iov->iov_len = min((u64)len - s, size);
> > _iov->iov_base = (void __user *)(unsigned long)
> > - (node->userspace_addr + addr - node->start);
> > + (node->userspace_addr +
> > + array_index_nospec(addr - node->start,
> > + node->size));
> > s += size;
> > addr += size;
> > ++ret;