Re: [PATCH 1/1] virtio/s390: fix vritio-ccw device teardown

From: Halil Pasic
Date: Mon Sep 20 2021 - 23:30:44 EST


On Mon, 20 Sep 2021 12:07:23 +0200
Cornelia Huck <cohuck@xxxxxxxxxx> wrote:

> On Mon, Sep 20 2021, Vineeth Vijayan <vneethv@xxxxxxxxxxxxx> wrote:
>
> > On Mon, 2021-09-20 at 00:39 +0200, Halil Pasic wrote:
> >> On Fri, 17 Sep 2021 10:40:20 +0200
> >> Cornelia Huck <cohuck@xxxxxxxxxx> wrote:
> >>
> > ...snip...
> >> > >
> >> > > Thanks, if I find time for it, I will try to understand this
> >> > > better and
> >> > > come back with my findings.
> >> > >
> >> > > > > * Can virtio_ccw_remove() get called while !cdev->online and
> >> > > > > virtio_ccw_online() is running on a different cpu? If yes,
> >> > > > > what would
> >> > > > > happen then?
> >> > > >
> >> > > > All of the remove/online/... etc. callbacks are invoked via the
> >> > > > ccw bus
> >> > > > code. We have to trust that it gets it correct :) (Or have the
> >> > > > common
> >> > > > I/O layer maintainers double-check it.)
> >> > > >
> >> > >
> >> > > Vineeth, what is your take on this? Are the struct ccw_driver
> >> > > virtio_ccw_remove and the virtio_ccw_online callbacks mutually
> >> > > exclusive. Please notice that we may initiate the onlining by
> >> > > calling ccw_device_set_online() from a workqueue.
> >> > >
> >> > > @Conny: I'm not sure what is your definition of 'it gets it
> >> > > correct'...
> >> > > I doubt CIO can make things 100% foolproof in this area.
> >> >
> >> > Not 100% foolproof, but "don't online a device that is in the
> >> > progress
> >> > of going away" seems pretty basic to me.
> >> >
> >>
> >> I hope Vineeth will chime in on this.
> > Considering the online/offline processing,
> > The ccw_device_set_offline function or the online/offline is handled
> > inside device_lock. Also, the online_store function takes care of
> > avoiding multiple online/offline processing.
> >
> > Now, when we consider the unconditional remove of the device,
> > I am not familiar with the virtio_ccw driver. My assumptions are based
> > on how CIO/dasd drivers works. If i understand correctly, the dasd
> > driver sets different flags to make sure that a device_open is getting
> > prevented while the the device is in progress of offline-ing.
>
> Hm, if we are invoking the online/offline callbacks under the device
> lock already,

I believe we have a misunderstanding here. I believe that Vineeth is
trying to tell us, that online_store_handle_offline() and
online_store_handle_offline() are called under the a device lock of
the ccw device. Right, Vineeth?

Conny, I believe, by online/offline callbacks, you mean
virtio_ccw_online() and virtio_ccw_offline(), right?

But the thing is that virtio_ccw_online() may get called (and is
typically called, AFAICT) with no locks held via:
virtio_ccw_probe() --> async_schedule(virtio_ccw_auto_online, cdev)
-*-> virtio_ccw_auto_online(cdev) --> ccw_device_set_online(cdev) -->
virtio_ccw_online()

Furthermore after a closer look, I believe because we don't take
a reference to the cdev in probe, we may get virtio_ccw_auto_online()
called with an invalid pointer (the pointer is guaranteed to be valid
in probe, but because of async we have no guarantee that it will be
called in the context of probe).

Shouldn't we take a reference to the cdev in probe? BTW what is the
reason for the async?


> how would that affect the remove callback?

I believe dev->bus->remove(dev) is called by
bus_remove_device() with the device lock held. I.e. I believe that means
that virtio_ccw_remove() is called with the ccw devices device lock
held. Vineeth can you confirm that?


The thing is, both virtio_ccw_remove() and virtio_ccw_offline() are
very similar, with the notable exception that offline assumes we are
online() at the moment, while remove() does the same only if it
decides based on vcdev && cdev->online that we are online.


> Shouldn't they
> be serialized under the device lock already? I think we are fine.

AFAICT virtio_ccw_remove() and virtio_ccw_offline() are serialized
against each other under the device lock. And also against
virtio_ccw_online() iff it was initiated via the sysfs, and not via
the auto-online mechanism.

Thus I don't think we are fine at the moment.

>
> For dasd, I think they also need to deal with the block device
> lifetimes. For virtio-ccw, we are basically a transport that does not
> know about devices further down the chain (that are associated with the
> virtio device, whose lifetime is tied to online/offline processing.) I'd
> presume that the serialization above would be enough.
>

I don't know about dasd that much. For the reasons stated above, I don't
think the serialization we have right now is entirely sufficient.

Regards,
Halil