RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
From: Parav Pandit
Date: Tue Mar 26 2019 - 23:20:17 EST
Hi Alex,
> -----Original Message-----
> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Tuesday, March 26, 2019 10:27 AM
> To: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> Cc: Parav Pandit <parav@xxxxxxxxxxxx>; kvm@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; Neo Jia <cjia@xxxxxxxxxx>
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
>
> On Tue, 26 Mar 2019 12:36:22 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>
> > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > There are five problems with current code structure.
> > > 1. mdev device is placed on the mdev bus before it is created in the
> > > vendor driver. Once a device is placed on the mdev bus without
> > > creating its supporting underlying vendor device, an open() can get
> > > triggered by userspace on partially initialized device.
> > > Below ladder diagram highlight it.
> > >
> > > cpu-0 cpu-1
> > > ----- -----
> > > create_store()
> > > mdev_create_device()
> > > device_register()
> > > ...
> > > vfio_mdev_probe()
> > > ...creates char device
> > > vfio_mdev_open()
> > > parent->ops->open(mdev)
> > > vfio_ap_mdev_open()
> > > matrix_mdev = NULL
> > > [...]
> > > parent->ops->create()
> > > vfio_ap_mdev_create()
> > > mdev_set_drvdata(mdev, matrix_mdev);
> > > /* Valid pointer set above */
> > >
> >
> > VFIO interface uses sysfs path of device or PCI device's BDF where it
> > checks sysfs file for that device exist.
> > In case of VFIO mdev device, above situation will never happen as open
> > will only get called if sysfs entry for that device exist.
> >
> > If you don't use VFIO interface then this situation can arise. In that
> > case probe() can be used for very basic initialization then create
> > actual char device from create().
> >
> >
> > > 2. Current creation sequence is,
> > > parent->ops_create()
> > > groups_register()
> > >
> > > Remove sequence is,
> > > parent->ops->remove()
> > > groups_unregister()
> > > However, remove sequence should be exact mirror of creation sequence.
> > > Once this is achieved, all users of the mdev will be terminated
> > > first before removing underlying vendor device.
> > > (Follow standard linux driver model).
> > > At that point vendor's remove() ops shouldn't failed because device
> > > is taken off the bus that should terminate the users.
> > >
> >
> > If VMM or user space application is using mdev device,
> > parent->ops->remove() can return failure. In that case sysfs files
> > shouldn't be removed. Hence above sequence is followed for remove.
> >
> > Standard linux driver model doesn't allow remove() to fail, but in of
> > mdev framework, interface is defined to handle such error case.
> >
> >
> > > 3. Additionally any new mdev driver that wants to work on mdev
> > > device during probe() routine registered using
> > > mdev_register_driver() needs to get stable mdev structure.
> > >
> >
> > Things that you are trying to handle with mdev structure from probe(),
> > couldn't that be moved to create()?
> >
> >
> > > 4. In following sequence, child devices created while removing mdev
> > > parent device can be left out, or it may lead to race of removing
> > > half initialized child mdev devices.
> > >
> > > issue-1:
> > > --------
> > > cpu-0 cpu-1
> > > ----- -----
> > > mdev_unregister_device()
> > > device_for_each_child()
> > > mdev_device_remove_cb()
> > > mdev_device_remove()
> > > create_store()
> > > mdev_device_create() [...]
> > > device_register()
> > > parent_remove_sysfs_files()
> > > /* BUG: device added by cpu-0
> > > * whose parent is getting removed.
> > > */
> > >
> > > issue-2:
> > > --------
> > > cpu-0 cpu-1
> > > ----- -----
> > > create_store()
> > > mdev_device_create() [...]
> > > device_register()
> > >
> > > [...] mdev_unregister_device()
> > > device_for_each_child()
> > > mdev_device_remove_cb()
> > > mdev_device_remove()
> > >
> > > mdev_create_sysfs_files()
> > > /* BUG: create is adding
> > > * sysfs files for a device
> > > * which is undergoing removal.
> > > */
> > > parent_remove_sysfs_files()
> > >
> > > 5. Below crash is observed when user initiated remove is in progress
> > > and mdev_unregister_driver() completes parent unregistration.
> > >
> > > cpu-0 cpu-1
> > > ----- -----
> > > remove_store()
> > > mdev_device_remove()
> > > active = false;
> > > mdev_unregister_device()
> > > remove type
> > > [...]
> > > mdev_remove_ops() crashes.
> > >
> > > This is similar race like create() racing with mdev_unregister_device().
> > >
> > > mtty mtty: MDEV: Registered
> > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id =
> > > 57 mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > > mtty_dev: Unloaded!
> > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > mdev_device_remove+0xef/0x130 [mdev]
> > > remove_store+0x77/0xa0 [mdev]
> > > kernfs_fop_write+0x113/0x1a0
> > > __vfs_write+0x33/0x1b0
> > > ? rcu_read_lock_sched_held+0x64/0x70
> > > ? rcu_sync_lockdep_assert+0x2a/0x50 ? __sb_start_write+0x121/0x1b0
> > > ? vfs_write+0x17c/0x1b0
> > > vfs_write+0xad/0x1b0
> > > ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > ksys_write+0x55/0xc0
> > > do_syscall_64+0x5a/0x210
> > >
> > > Therefore, mdev core is improved in following ways to overcome above
> > > issues.
> > >
> > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > creation which supports the mdev creation.
> > > This ensures that mdev specific all necessary fields are initialized
> > > before a given mdev can be accessed by bus driver.
> > >
> > > 2. During remove flow, first remove the device from the bus. This
> > > ensures that any bus specific devices and data is cleared.
> > > Once device is taken of the mdev bus, perform remove() of mdev from
> > > the vendor driver.
> > >
> >
> > If user space application is using the device and someone underneath
> > remove the device from bus, how would use space application know that
> > device is being removed?
> > If DMA is setup, user space application is accessing that memory and
> > device is removed from bus - how will you restrict to not to remove
> > that device? If remove() is not restricted then host might crash.
> > I know Linux kernel device core model doesn't allow remove() to fail,
> > but we had tackled that problem for mdev devices in this framework. I
> > prefer not to change this behavior. This will regress existing working
> > drivers.
>
>
> We have exactly this issue with vfio-pci, or really any vfio driver, where the
> solution is that a remove request is blocked until the device becomes
> unused by the user. In fact there's a notification that userspace can connect
> to so that we don't need to silently wait for userspace to be done. We could
> also potentially kill the userspace application using the device, or if we ever
> implemented revoke support for mmaps, we could unmap the device and
> the use could handle the SIGBUS. With Parav's suggestion to fix the ordering
> such that the device is first removed from the bus, where the blocking
> opportunity comes into play, it might be time to let go of this one-off
> force/not-force behavior. Thanks,
>
Yes. I think we should do it.
For now (for next few days), I am dropping this particular order fixing patch from the series.
>From my last 8th patch, I am keeping only the fix for create/remove race with parent removal along with other fixes and cleanup.
Posting the v1 in sometime to make progress on already reviewed parts and part of the 8th patch.
I cannot split the remove_common() helper function to a different patch, because remove_cb() will bypass mdev->active check without srcu().
So as individual patch, its not correct behavior.
Hence, that small refactor is part of srcu fix.