Re: [PATCH v2] virtio_pmem: Check device status before requesting flush
From: Philip Chen
Date: Wed Aug 21 2024 - 17:31:24 EST
Hi
On Wed, Aug 21, 2024 at 1:37 PM Ira Weiny <ira.weiny@xxxxxxxxx> wrote:
>
> Philip Chen wrote:
> > Hi,
> >
> > On Tue, Aug 20, 2024 at 1:01 PM Dave Jiang <dave.jiang@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > > On 8/20/24 10:22 AM, Philip Chen wrote:
> > > > If a pmem device is in a bad status, the driver side could wait for
> > > > host ack forever in virtio_pmem_flush(), causing the system to hang.
> > > >
> > > > So add a status check in the beginning of virtio_pmem_flush() to return
> > > > early if the device is not activated.
> > > >
> > > > Signed-off-by: Philip Chen <philipchen@xxxxxxxxxxxx>
> > > > ---
> > > >
> > > > v2:
> > > > - Remove change id from the patch description
> > > > - Add more details to the patch description
> > > >
> > > > drivers/nvdimm/nd_virtio.c | 9 +++++++++
> > > > 1 file changed, 9 insertions(+)
> > > >
> > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > > > index 35c8fbbba10e..97addba06539 100644
> > > > --- a/drivers/nvdimm/nd_virtio.c
> > > > +++ b/drivers/nvdimm/nd_virtio.c
> > > > @@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
> > > > unsigned long flags;
> > > > int err, err1;
> > > >
> > > > + /*
> > > > + * Don't bother to submit the request to the device if the device is
> > > > + * not acticated.
> > >
> > > s/acticated/activated/
> >
> > Thanks for the review.
> > I'll fix this typo in v3.
> >
> > In addition to this typo, does anyone have any other concerns?
>
> I'm not super familiar with the virtio-pmem workings and the needs reset
> flag is barely used.
>
> Did you actually experience this hang? How was this found? What is the
> user visible issue and how critical is it?
Yes, I experienced the problem when trying to enable hibernation for a VM.
In the typical hibernation flow, the kernel would try to:
(1) freeze the processes
(2) freeze the devices
(3) take a snapshot of the memory
(4) thaw the devices
(5) write the snapshot to the storage
(6) power off the system (or perform platform-specific action)
In my case, I see VMM fail to re-activate the virtio_pmem device in (4).
(And therefore the virtio_pmem device side sets VIRTIO_CONFIG_S_NEEDS_RESET.)
As a result, when the kernel tries to power off the virtio_pmem device
in (6), the system would hang in virtio_pmem_flush() if this patch is
not added.
To fix the root cause of this issue, I sent another patch to add
freeze/restore PM callbacks to the virtio_pmem driver:
https://lore.kernel.org/lkml/20240815004617.2325269-1-philipchen@xxxxxxxxxxxx/
(Please also help review that patch.)
However, I think this patch is still helpful since the system
shouldn't hang in virtio_pmem_flush() regardless of the device state.
>
> Thanks,
> Ira
>
> >
> > >
> > > > + */
> > > > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) {
> > > > + dev_info(&vdev->dev, "virtio pmem device needs a reset\n");
> > > > + return -EIO;
> > > > + }
> > > > +
> > > > might_sleep();
> > > > req_data = kmalloc(sizeof(*req_data), GFP_KERNEL);
> > > > if (!req_data)
>
>