Re: [PATCH] virtio_ring: Fix the stale index in available ring

From: Michael S. Tsirkin
Date: Wed Mar 20 2024 - 03:15:11 EST


On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> >
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> >
> >
> > static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> > /* Put entry in available array (but don't update avail->idx until they
> > * do sync). */
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > + u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> > /* Descriptors and available array need to be set before we expose the
> > * new available array entries. */
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> > static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> > __virtio16 *head, int idx)
> > {
> > - return vhost_get_avail(vq, *head,
> > + unsigned i = idx;
> > + unsigned flag = i & ~(vq->num - 1);
> > + unsigned val = vhost_get_avail(vq, *head,
> > &vq->avail->ring[idx & (vq->num - 1)]);
> > + unsigned valflag = val & ~(vq->num - 1);
> > +
> > + WARN_ON(valflag != flag);
> > +
> > + return val & (vq->num - 1);
> > }
>
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
> byte is still correct and valid?


I would find this very surprising.

> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] =
> cpu_to_virtio16(_vq->vdev, head | (avail << 8));
>
>
> head = vhost16_to_cpu(vq, ring_head);
> WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
> head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.


The advantage of this debugging patch is that it will detect the issue immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue when
> vhost is enabled? My previous understanding is no, the queue is totally owned by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [ 11.785745] ================ virtqueue_get_buf_ctx_split ================
> [ 11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [ 11.786655] head to be released: 036 077
> [ 11.786952]
> [ 11.786952] avail_idx:
> [ 11.787234] 000 63985 <--
> [ 11.787237] 001 63986
> [ 11.787444] 002 63987
> [ 11.787632] 003 63988
> [ 11.787821] 004 63989
> [ 11.788006] 005 63990
> [ 11.788194] 006 63991
> [ 11.788381] 007 63992
> [ 11.788567] 008 63993
> [ 11.788772] 009 63994
> [ 11.788957] 010 63995
> [ 11.789141] 011 63996
> [ 11.789327] 012 63997
> [ 11.789515] 013 63998
> [ 11.789701] 014 63999
> [ 11.789886] 015 64000
> [ 11.790068]
> [ 11.790068] avail_head:
> [ 11.790529] 000 075 <--
> [ 11.790718] 001 036
> [ 11.790890] 002 077
> [ 11.791061] 003 129
> [ 11.791231] 004 072
> [ 11.791400] 005 130
> [ 11.791574] 006 015
> [ 11.791748] 007 074
> [ 11.791918] 008 130
> [ 11.792094] 009 130
> [ 11.792263] 010 074
> [ 11.792437] 011 015
> [ 11.792617] 012 072
> [ 11.792788] 013 129
> [ 11.792961] 014 077 // The last two heads from guest to host: 077, 036
> [ 11.793134] 015 036

Maybe dump the avail ring from guest to make sure
it matches the expected contents?

> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
>
> avail_idx
> 000 63998
> 001 64000
> 002 63954 <---
> 003 63955
> 004 63956
> 005 63974
> 006 63981
> 007 63984
> 008 63986
> 009 63987
> 010 63988
> 011 63989
> 012 63992
> 013 63993
> 014 63995
> 015 63997
>
> avail_head
> 000 074
> 001 015
> 002 072
> 003 129
> 004 074 // The last two heads seen by vhost is: 074, 036
> 005 036
> 006 075 <---


And is 074 the previous (stale) value in the ring?


> 007 036
> 008 077
> 009 129
> 010 072
> 011 130
> 012 015
> 013 074
> 014 130
> 015 130



> used_idx
> 000 64000
> 001 63882 <---
> 002 63889
> 003 63891
> 004 63898
> 005 63936
> 006 63942
> 007 63946
> 008 63949
> 009 63953
> 010 63957
> 011 63981
> 012 63990
> 013 63992
> 014 63993
> 015 63999
>
> used_head
> 000 072
> 001 129
> 002 074 // The last two heads published to guest is: 074, 036
> 003 036
> 004 075 <---
> 005 036
> 006 077
> 007 129
> 008 072
> 009 130
> 010 015
> 011 074
> 012 130
> 013 130
> 014 074
> 015 015
>
> Thanks,
> Gavin

I like this debugging patch, it might make sense to
polish it up and include in production. We'll want it in
debugfs naturally not /proc/vhost.

But all in good time.


>
>