Re: [PATCH 3/3] caif_virtio: fix the race between reset and netdev unregister

From: Michael S. Tsirkin
Date: Mon Jun 20 2022 - 06:19:20 EST


On Mon, Jun 20, 2022 at 05:18:29PM +0800, Jason Wang wrote:
> On Mon, Jun 20, 2022 at 5:09 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
> >
> > On Mon, Jun 20, 2022 at 01:11:15PM +0800, Jason Wang wrote:
> > > We use to do the following steps during .remove():
> >
> > We currently do
> >
> >
> > > static void cfv_remove(struct virtio_device *vdev)
> > > {
> > > struct cfv_info *cfv = vdev->priv;
> > >
> > > rtnl_lock();
> > > dev_close(cfv->ndev);
> > > rtnl_unlock();
> > >
> > > tasklet_kill(&cfv->tx_release_tasklet);
> > > debugfs_remove_recursive(cfv->debugfs);
> > >
> > > vringh_kiov_cleanup(&cfv->ctx.riov);
> > > virtio_reset_device(vdev);
> > > vdev->vringh_config->del_vrhs(cfv->vdev);
> > > cfv->vr_rx = NULL;
> > > vdev->config->del_vqs(cfv->vdev);
> > > unregister_netdev(cfv->ndev);
> > > }
> > > This is racy since device could be re-opened after dev_close() but
> > > before unregister_netdevice():
> > >
> > > 1) RX vringh is cleaned before resetting the device, rx callbacks that
> > > is called after the vringh_kiov_cleanup() will result a UAF
> > > 2) Network stack can still try to use TX virtqueue even if it has been
> > > deleted after dev_vqs()
> > >
> > > Fixing this by unregistering the network device first to make sure not
> > > device access from both TX and RX side.
> > >
> > > Fixes: 0d2e1a2926b18 ("caif_virtio: Introduce caif over virtio")
> > > Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx>
> > > ---
> > > drivers/net/caif/caif_virtio.c | 6 ++----
> > > 1 file changed, 2 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/caif/caif_virtio.c b/drivers/net/caif/caif_virtio.c
> > > index 66375bea2fcd..a29f9b2df5b1 100644
> > > --- a/drivers/net/caif/caif_virtio.c
> > > +++ b/drivers/net/caif/caif_virtio.c
> > > @@ -752,9 +752,8 @@ static void cfv_remove(struct virtio_device *vdev)
> > > {
> > > struct cfv_info *cfv = vdev->priv;
> > >
> > > - rtnl_lock();
> > > - dev_close(cfv->ndev);
> > > - rtnl_unlock();
> > > + /* Make sure NAPI/TX won't try to access the device */
> > > + unregister_netdev(cfv->ndev);
> > >
> > > tasklet_kill(&cfv->tx_release_tasklet);
> > > debugfs_remove_recursive(cfv->debugfs);
> > > @@ -764,7 +763,6 @@ static void cfv_remove(struct virtio_device *vdev)
> > > vdev->vringh_config->del_vrhs(cfv->vdev);
> > > cfv->vr_rx = NULL;
> > > vdev->config->del_vqs(cfv->vdev);
> > > - unregister_netdev(cfv->ndev);
> > > }
> >
> >
> > This gives me pause, callbacks can now trigger after device
> > has been unregistered. Are we sure this is safe?
>
> It looks safe, for RX NAPI is disabled. For TX, tasklet is disabled
> after tasklet_kill(). I can add a comment to explain this.

that waits for outstanding tasklets but does it really prevent
future ones?

> > Won't it be safer to just keep the rtnl_lock around
> > the whole process?
>
> It looks to me we rtnl_lock can't help in synchronizing with the
> callbacks, anything I miss?
>
> Thanks

good point.


> >
> > > static struct virtio_device_id id_table[] = {
> > > --
> > > 2.25.1
> >