Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

From: Roger Pau MonnÃ
Date: Thu May 28 2020 - 09:10:08 EST


On Tue, May 19, 2020 at 11:27:50PM +0000, Anchal Agarwal wrote:
> From: Munehisa Kamata <kamatam@xxxxxxxxxx>
>
> S4 power transition states are much different than xen
> suspend/resume. Former is visible to the guest and frontend drivers should
> be aware of the state transitions and should be able to take appropriate
> actions when needed. In transition to S4 we need to make sure that at least
> all the in-flight blkif requests get completed, since they probably contain
> bits of the guest's memory image and that's not going to get saved any
> other way. Hence, re-issuing of in-flight requests as in case of xen resume
> will not work here. This is in contrast to xen-suspend where we need to
> freeze with as little processing as possible to avoid dirtying RAM late in
> the migration cycle and we know that in-flight data can wait.
>
> Add freeze, thaw and restore callbacks for PM suspend and hibernation
> support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> events, need to implement these xenbus_driver callbacks. The freeze handler
> stops block-layer queue and disconnect the frontend from the backend while
> freeing ring_info and associated resources. Before disconnecting from the
> backend, we need to prevent any new IO from being queued and wait for existing
> IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
> requests in use on the shared ring. However, for sanity we should check
> state of the ring before disconnecting to make sure that there are no
> outstanding requests to be processed on the ring. The restore handler
> re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
> the backend, so that rest of the kernel can continue to use the block device
> transparently.
>
> Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> xen/blkback: unmap all persistent grants when frontend gets disconnected,
> the frontend may see massive amount of grant table warning when freeing
> resources.
> [ 36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
> [ 36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
>
> In this case, persistent grants would need to be disabled.
>
> [Anchal Changelog: Removed timeout/request during blkfront freeze.
> Reworked the whole patch to work with blk-mq and incorporate upstream's
> comments]

Please tag versions using vX and it would be helpful if you could list
the specific changes that you performed between versions. There where
3 RFC versions IIRC, and there's no log of the changes between them.

>
> Signed-off-by: Anchal Agarwal <anchalag@xxxxxxxxxx>
> Signed-off-by: Munehisa Kamata <kamatam@xxxxxxxxxx>
> ---
> drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++--
> 1 file changed, 115 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3b889ea950c2..464863ed7093 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -48,6 +48,8 @@
> #include <linux/list.h>
> #include <linux/workqueue.h>
> #include <linux/sched/mm.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
>
> #include <xen/xen.h>
> #include <xen/xenbus.h>
> @@ -80,6 +82,8 @@ enum blkif_state {
> BLKIF_STATE_DISCONNECTED,
> BLKIF_STATE_CONNECTED,
> BLKIF_STATE_SUSPENDED,
> + BLKIF_STATE_FREEZING,
> + BLKIF_STATE_FROZEN

Nit: adding a terminating ',' would prevent further additions from
having to modify this line.

> };
>
> struct grant {
> @@ -219,6 +223,7 @@ struct blkfront_info
> struct list_head requests;
> struct bio_list bio_list;
> struct list_head info_list;
> + struct completion wait_backend_disconnected;
> };
>
> static unsigned int nr_minors;
> @@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
> info->sector_size = sector_size;
> info->physical_sector_size = physical_sector_size;
> blkif_set_queue_limits(info);
> + init_completion(&info->wait_backend_disconnected);
>
> return 0;
> }
> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> case XEN_SCSI_DISK5_MAJOR:
> case XEN_SCSI_DISK6_MAJOR:
> case XEN_SCSI_DISK7_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) +
> + *offset = (*minor / PARTS_PER_DISK) +
> ((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
> EMULATED_SD_DISK_NAME_OFFSET;
> *minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
> case XEN_SCSI_DISK13_MAJOR:
> case XEN_SCSI_DISK14_MAJOR:
> case XEN_SCSI_DISK15_MAJOR:
> - *offset = (*minor / PARTS_PER_DISK) +
> + *offset = (*minor / PARTS_PER_DISK) +

Unrelated changes, please split to a pre-patch.

> ((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
> EMULATED_SD_DISK_NAME_OFFSET;
> *minor = *minor +
> @@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> unsigned int i;
> struct blkfront_ring_info *rinfo;
>
> + if (info->connected == BLKIF_STATE_FREEZING)
> + goto free_rings;
> /* Prevent new requests being issued until we fix things up. */
> info->connected = suspend ?
> BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
> @@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> if (info->rq)
> blk_mq_stop_hw_queues(info->rq);
>
> +free_rings:
> for_each_rinfo(info, rinfo, i)
> blkif_free_ring(rinfo);
>
> @@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
> struct blkfront_info *info = rinfo->dev_info;
>
> - if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> - return IRQ_HANDLED;
> + if (unlikely(info->connected != BLKIF_STATE_CONNECTED
> + && info->connected != BLKIF_STATE_FREEZING)){

Extra tab and missing space between '){'. Also my preference would be
for the && to go at the end of the previous line, like it's done
elsewhere in the file.

> + return IRQ_HANDLED;
> + }
>
> spin_lock_irqsave(&rinfo->ring_lock, flags);
> again:
> @@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
> unsigned int segs;
> struct blkfront_ring_info *rinfo;
>
> + bool frozen = info->connected == BLKIF_STATE_FROZEN;

Please put this together with the rest of the variable definitions,
and leave the empty line as a split between variable definitions and
code. I've already requested this on RFC v3 but you seem to have
dropped some of the requests I've made there.

> blkfront_gather_backend_features(info);
> /* Reset limits changed by blk_mq_update_nr_hw_queues(). */
> blkif_set_queue_limits(info);
> @@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
> kick_pending_request_queues(rinfo);
> }
>
> + if (frozen)
> + return 0;
> +
> list_for_each_entry_safe(req, n, &info->requests, queuelist) {
> /* Requeue pending requests (flush or discard) */
> list_del_init(&req->queuelist);
> @@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
>
> return;
> case BLKIF_STATE_SUSPENDED:
> + case BLKIF_STATE_FROZEN:
> /*
> * If we are recovering from suspension, we need to wait
> * for the backend to announce it's features before
> @@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
> break;
>
> case XenbusStateClosed:
> - if (dev->state == XenbusStateClosed)
> + if (dev->state == XenbusStateClosed) {
> + if (info->connected == BLKIF_STATE_FREEZING) {
> + blkif_free(info, 0);
> + info->connected = BLKIF_STATE_FROZEN;
> + complete(&info->wait_backend_disconnected);
> + break;

There's no need for the break here, you can rely on the break below.

> + }
> +
> break;
> + }
> +
> + /*
> + * We may somehow receive backend's Closed again while thawing
> + * or restoring and it causes thawing or restoring to fail.
> + * Ignore such unexpected state regardless of the backend state.
> + */
> + if (info->connected == BLKIF_STATE_FROZEN) {

I think you can join this with the previous dev->state == XenbusStateClosed?

Also, won't the device be in the Closed state already if it's in state
frozen?

> + dev_dbg(&dev->dev,
> + "ignore the backend's Closed state: %s",
> + dev->nodename);
> + break;
> + }
> /* fall through */
> case XenbusStateClosing:
> - if (info)
> - blkfront_closing(info);
> + if (info) {
> + if (info->connected == BLKIF_STATE_FREEZING)
> + xenbus_frontend_closed(dev);
> + else
> + blkfront_closing(info);
> + }
> break;
> }
> }
> @@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
> mutex_unlock(&blkfront_mutex);
> }
>
> +static int blkfront_freeze(struct xenbus_device *dev)
> +{
> + unsigned int i;
> + struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> + struct blkfront_ring_info *rinfo;
> + /* This would be reasonable timeout as used in xenbus_dev_shutdown() */
> + unsigned int timeout = 5 * HZ;
> + unsigned long flags;
> + int err = 0;
> +
> + info->connected = BLKIF_STATE_FREEZING;
> +
> + blk_mq_freeze_queue(info->rq);
> + blk_mq_quiesce_queue(info->rq);
> +
> + for_each_rinfo(info, rinfo, i) {
> + /* No more gnttab callback work. */
> + gnttab_cancel_free_callback(&rinfo->callback);
> + /* Flush gnttab callback work. Must be done with no locks held. */
> + flush_work(&rinfo->work);
> + }
> +
> + for_each_rinfo(info, rinfo, i) {
> + spin_lock_irqsave(&rinfo->ring_lock, flags);
> + if (RING_FULL(&rinfo->ring)
> + || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {

'||' should go at the end of the previous line.

> + xenbus_dev_error(dev, err, "Hibernation Failed.
> + The ring is still busy");
> + info->connected = BLKIF_STATE_CONNECTED;
> + spin_unlock_irqrestore(&rinfo->ring_lock, flags);

You need to unfreeze the queues here, or else the device will be in a
blocked state AFAICT.

> + return -EBUSY;
> + }
> + spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> + }

This block has indentation all messed up.

> + /* Kick the backend to disconnect */
> + xenbus_switch_state(dev, XenbusStateClosing);
> +
> + /*
> + * We don't want to move forward before the frontend is diconnected
> + * from the backend cleanly.
> + */
> + timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
> + timeout);
> + if (!timeout) {
> + err = -EBUSY;

Note err is only used here, and I think could just be dropped.

> + xenbus_dev_error(dev, err, "Freezing timed out;"
> + "the device may become inconsistent state");

Leaving the device in this state is quite bad, as it's in a closed
state and with the queues frozen. You should make an attempt to
restore things to a working state.

> + }
> +
> + return err;
> +}
> +
> +static int blkfront_restore(struct xenbus_device *dev)
> +{
> + struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> + int err = 0;
> +
> + err = talk_to_blkback(dev, info);
> + blk_mq_unquiesce_queue(info->rq);
> + blk_mq_unfreeze_queue(info->rq);
> + if (!err)
> + blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);

Bad indentation. Also shouldn't you first update the queues and then
unfreeze them?

Thanks, Roger.