RE: [Patch v3 2/3] Drivers: hv: add Azure Blob driver

From: Michael Kelley
Date: Fri Jul 16 2021 - 17:07:08 EST


From: Long Li <longli@xxxxxxxxxxxxx> Sent: Friday, July 16, 2021 12:27 PM

[snip]

> > > +
> > > +static int az_blob_connect_to_vsp(struct hv_device *device, u32
> > > +ring_size) {
> > > + int ret;
> > > +
> > > + spin_lock_init(&az_blob_dev.file_lock);
> > > + INIT_LIST_HEAD(&az_blob_dev.file_list);
> > > + init_waitqueue_head(&az_blob_dev.file_wait);
> > > +
> > > + az_blob_dev.device = device;
> > > + device->channel->rqstor_size = AZ_BLOB_QUEUE_DEPTH;
> > > +
> > > + ret = vmbus_open(device->channel, ring_size, ring_size, NULL, 0,
> > > + az_blob_on_channel_callback, device->channel);
> > > +
> > > + if (ret)
> > > + return ret;
> > > +
> > > + hv_set_drvdata(device, &az_blob_dev);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > +static void az_blob_remove_vmbus(struct hv_device *device) {
> > > + /* At this point, no VSC/VSP traffic is possible over vmbus */
> > > + hv_set_drvdata(device, NULL);
> > > + vmbus_close(device->channel);
> > > +}
> > > +
> > > +static int az_blob_probe(struct hv_device *device,
> > > + const struct hv_vmbus_device_id *dev_id) {
> > > + int ret;
> > > +
> > > + ret = az_blob_connect_to_vsp(device, AZ_BLOB_RING_SIZE);
> > > + if (ret) {
> > > + az_blob_err("error connecting to VSP ret=%d\n", ret);
> > > + return ret;
> > > + }
> > > +
> > > + // create user-mode client library facing device
> > > + ret = az_blob_create_device(&az_blob_dev);
> > > + if (ret) {
> > > + az_blob_err("failed to create device ret=%d\n", ret);
> > > + az_blob_remove_vmbus(device);
> > > + return ret;
> > > + }
> > > +
> > > + az_blob_dev.removing = false;
> >
> > This line seems misplaced. As soon as az_blob_create_device() returns,
> > some other thread could find the device and open it. You don't want the
> > az_blob_fop_open() function to find the "removing"
> > flag set to true. So I think this line should go *before* the call to
> > az_blob_create_device().
> >
> > > + az_blob_info("successfully probed device\n");
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int az_blob_remove(struct hv_device *dev) {
> > > + az_blob_dev.removing = true;
> > > + /*
> > > + * RCU lock guarantees that any future calls to az_blob_fop_open()
> > > + * can not use device resources while the inode reference of
> > > + * /dev/azure_blob is being held for that open, and device file is
> > > + * being removed from /dev.
> > > + */
> > > + synchronize_rcu();
> >
> > I don't think this works as you have described. While it will ensure that any
> > in-progress RCU read-side critical sections have completed (i.e., in
> > az_blob_fop_open() ), it does not prevent new read-side critical sections
> > from being started. So as soon as synchronize_rcu() is finished, another
> > thread could find and open the device, and be executing in
> > az_blob_fop_open().
> >
> > But it's not clear to me that this (and the rcu_read_locks in
> > az_blob_fop_open) are really needed anyway. If az_blob_remove_device()
> > is called while one or more threads have it open, I think that's OK. Or is there
> > a scenario that I'm missing?
>
> I was trying to address your comment earlier. Here were your comments (1 - 7):
>
> 1) The driver is unbound, so az_blob_remove() is called.
> 2) az_blob_remove() sets the "removing" flag to true, and calls az_blob_remove_device().
> 3) az_blob_remove_device() waits for the file_list to become empty.
> 4) After the file_list becomes empty, but before misc_deregister() is called, a separate thread opens the device again.
> 5) In the separate thread, az_blob_fop_open() obtains the file_lock spin lock.
> 6) Before az_blob_fop_open() releases the spin lock, az
> block_remove_device() completes, along with az_blob_remove().
> 7) Then the device gets rebound, and az_blob_connect_to_vsp() gets called, all while az_blob_fop_open() still holds the
> spin lock. So the spin lock get re-initialized while it is held.
>
> Between step 4 and step 5, I don't see any guarantee that az_blob_fop_open() can't run concurrently on another CPU
> after misc_deregister() finishes. misc_deregister() calls devtmpfs_delete_node() to remove the device file from /dev/*,
> but it doesn't check the return value, so the inode reference number can be non-zero after it returns, somebody may still
> try to open it.

I'm no expert here, but once misc_deregister() finishes, I would expect that
any attempt to open the device with the assigned major:minor number will
fail. However, existing opens may still be active. So another thread could still
have it open, but no new opens can be done. As coded in this version of your
patch, az_blob_remove() waits until all those existing opens have been closed,
so everything is clean once the wait is complete. Of course, while waiting
here, the threads with the open fd could try to start another operation
against the VSP, but the "removing" flag will prevent that. So the only
access these threads will have is to the singleton az_blob_dev instance
and the list of open files.

But as noted previously, waiting here for the opens to be closed is
problematic because the wait could be arbitrarily long. And there's
messiness if the device gets re-added later, because there's no
distinction between the old instantiation that a thread might still have
open vs. the new instantiation that you want to initialize fresh. That's
where creating a new dynamic instance will solve any problems.

> This check guarantees that the code can't reference any driver's internal data structures.
> az_blob_dev.removing is set so this code can't be entered. Resetting it after az_blob_create_device() is also for this
> purpose.
>
> >
> > > +
> > > + az_blob_info("removing device\n");
> > > + az_blob_remove_device();
> > > +
> > > + /*
> > > + * At this point, it's not possible to open more files.
> > > + * Wait for all the opened files to be released.
> > > + */
> > > + wait_event(az_blob_dev.file_wait,
> > > +list_empty(&az_blob_dev.file_list));
> >
> > As mentioned in my most recent comments on the previous version of the
> > code, I'm concerned about waiting for all open files to be released in the case
> > of a VMbus rescind. We definitely have to wait for all VSP operations to
> > complete, but that's different from waiting for the files to be closed. The
> > former depends on Hyper-V being well-behaved and will presumably
> > happen quickly in the case of a rescind. But the latter depends entirely on
> > user space code that is out of the kernel's control. The user space process
> > could hang around for hours or days with the file still open, which would
> > block this function from completing, and hence block the global work queue.
> >
> > Thinking about this, the core issue may be that having a single static instance
> > of az_blob_device is problematic if we allow the device to be removed
> > (either explicitly with an unbind, or implicitly with a VMbus
> > rescind) and then re-added. Perhaps what needs to happen is that the
> > removed device is allowed to continue to exist as long as user space
> > processes have an open file handle, but they can't perform any operations
> > because the "removing" flag is set (and stays set).
> > If the device is re-added, then a new instance of az_blob_device should be
> > created, and whether or not the old instance is still hanging around is
> > irrelevant.
>
> I agree dynamic device objects is the way to go. Will implement this.
>
> >
> > > +
> > > + az_blob_remove_vmbus(dev);
> > > + return 0;
> > > +}
> > > +
> > > +static struct hv_driver az_blob_drv = {
> > > + .name = KBUILD_MODNAME,
> > > + .id_table = id_table,
> > > + .probe = az_blob_probe,
> > > + .remove = az_blob_remove,
> > > + .driver = {
> > > + .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> > > + },
> > > +};
> > > +
> > > +static int __init az_blob_drv_init(void) {
> > > + return vmbus_driver_register(&az_blob_drv);
> > > +}
> > > +
> > > +static void __exit az_blob_drv_exit(void) {
> > > + vmbus_driver_unregister(&az_blob_drv);
> > > +}
> > > +
> > > +MODULE_LICENSE("Dual BSD/GPL");
> > > +MODULE_DESCRIPTION("Microsoft Azure Blob driver");
> > > +module_init(az_blob_drv_init); module_exit(az_blob_drv_exit);
> > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> > > index 705e95d..3095611 100644
> > > --- a/drivers/hv/channel_mgmt.c
> > > +++ b/drivers/hv/channel_mgmt.c
> > > @@ -154,6 +154,13 @@
> > > .allowed_in_isolated = false,
> > > },
> > >
> > > + /* Azure Blob */
> > > + { .dev_type = HV_AZURE_BLOB,
> > > + HV_AZURE_BLOB_GUID,
> > > + .perf_device = false,
> > > + .allowed_in_isolated = false,
> > > + },
> > > +
> > > /* Unknown GUID */
> > > { .dev_type = HV_UNKNOWN,
> > > .perf_device = false,
> > > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index
> > > d1e59db..ac31362 100644
> > > --- a/include/linux/hyperv.h
> > > +++ b/include/linux/hyperv.h
> > > @@ -772,6 +772,7 @@ enum vmbus_device_type {
> > > HV_FCOPY,
> > > HV_BACKUP,
> > > HV_DM,
> > > + HV_AZURE_BLOB,
> > > HV_UNKNOWN,
> > > };
> > >
> > > @@ -1350,6 +1351,14 @@ int vmbus_allocate_mmio(struct resource
> > **new, struct hv_device *device_obj,
> > > 0x72, 0xe2, 0xff, 0xb1, 0xdc, 0x7f)
> > >
> > > /*
> > > + * Azure Blob GUID
> > > + * {0590b792-db64-45cc-81db-b8d70c577c9e}
> > > + */
> > > +#define HV_AZURE_BLOB_GUID \
> > > + .guid = GUID_INIT(0x0590b792, 0xdb64, 0x45cc, 0x81, 0xdb, \
> > > + 0xb8, 0xd7, 0x0c, 0x57, 0x7c, 0x9e)
> > > +
> > > +/*
> > > * Shutdown GUID
> > > * {0e0b6031-5213-4934-818b-38d90ced39db}
> > > */
> > > diff --git a/include/uapi/misc/azure_blob.h
> > > b/include/uapi/misc/azure_blob.h new file mode 100644 index
> > > 0000000..f4168679
> > > --- /dev/null
> > > +++ b/include/uapi/misc/azure_blob.h
> > > @@ -0,0 +1,34 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0-only WITH
> > > +Linux-syscall-note */
> > > +/* Copyright (c) Microsoft Corporation. */
> > > +
> > > +#ifndef _AZ_BLOB_H
> > > +#define _AZ_BLOB_H
> > > +
> > > +#include <linux/kernel.h>
> > > +#include <linux/uuid.h>
> > > +
> > > +/* user-mode sync request sent through ioctl */ struct
> > > +az_blob_request_sync_response {
> > > + __u32 status;
> > > + __u32 response_len;
> > > +};
> > > +
> > > +struct az_blob_request_sync {
> > > + guid_t guid;
> > > + __u32 timeout;
> > > + __u32 request_len;
> > > + __u32 response_len;
> > > + __u32 data_len;
> > > + __u32 data_valid;
> > > + __aligned_u64 request_buffer;
> >
> > Is there an implied 32-bit padding field before "request_buffer"?
> > It seems like "yes", since there are five 32-bit field preceding.
> > Would it be better to explicitly list that padding field?
> >
> > > + __aligned_u64 response_buffer;
> > > + __aligned_u64 data_buffer;
> > > + struct az_blob_request_sync_response response; };
> > > +
> > > +#define AZ_BLOB_MAGIC_NUMBER 'R'
> > > +#define IOCTL_AZ_BLOB_DRIVER_USER_REQUEST \
> > > + _IOWR(AZ_BLOB_MAGIC_NUMBER, 0xf0, \
> > > + struct az_blob_request_sync)
> > > +
> > > +#endif /* define _AZ_BLOB_H */
> > > --
> > > 1.8.3.1