Re: [PATCH v3] vfio/mdev: Check globally for duplicate devices

From: Kirti Wankhede
Date: Fri May 18 2018 - 14:07:32 EST




On 5/18/2018 11:00 PM, Alex Williamson wrote:
> On Fri, 18 May 2018 12:34:03 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>
>> On 5/18/2018 3:07 AM, Alex Williamson wrote:
>>> On Fri, 18 May 2018 01:56:50 +0530
>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>>>
>>>> On 5/17/2018 9:50 PM, Alex Williamson wrote:
>>>>> On Thu, 17 May 2018 21:25:22 +0530
>>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>>>>>
>>>>>> On 5/17/2018 1:39 PM, Cornelia Huck wrote:
>>>>>>> On Wed, 16 May 2018 21:30:19 -0600
>>>>>>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>>> When we create an mdev device, we check for duplicates against the
>>>>>>>> parent device and return -EEXIST if found, but the mdev device
>>>>>>>> namespace is global since we'll link all devices from the bus. We do
>>>>>>>> catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
>>>>>>>> with it comes a kernel warning and stack trace for trying to create
>>>>>>>> duplicate sysfs links, which makes it an undesirable response.
>>>>>>>>
>>>>>>>> Therefore we should really be looking for duplicates across all mdev
>>>>>>>> parent devices, or as implemented here, against our mdev device list.
>>>>>>>> Using mdev_list to prevent duplicates means that we can remove
>>>>>>>> mdev_parent.lock, but in order not to serialize mdev device creation
>>>>>>>> and removal globally, we add mdev_device.active which allows UUIDs to
>>>>>>>> be reserved such that we can drop the mdev_list_lock before the mdev
>>>>>>>> device is fully in place.
>>>>>>>>
>>>>>>>> NB. there was never intended to be any serialization guarantee
>>>>>>>> provided by the mdev core with respect to creation and removal of mdev
>>>>>>>> devices, mdev_parent.lock provided this only as a side-effect of the
>>>>>>>> implementation for locking the namespace per parent. That
>>>>>>>> serialization is now removed.
>>>>>>>
>>>>>>
>>>>>> mdev_parent.lock is to serialize create and remove of that mdev device,
>>>>>> that handles race condition that Cornelia mentioned below.
>>>>>
>>>>> Previously it was stated:
>>>>>
>>>>> On Thu, 17 May 2018 01:01:40 +0530
>>>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>>>>>> Here lock is not for create/remove routines of vendor driver, its about
>>>>>> mdev device creation and device registration, which is a common code
>>>>>> path, and so is part of mdev core module.
>>>>>
>>>>> So the race condition was handled previously, but as a side-effect of
>>>>> protecting the namespace, aiui. I'm trying to state above that the
>>>>> serialization of create/remove was never intended as a guarantee
>>>>> provided to mdev vendor drivers. I don't see that there's a need to
>>>>> protect "mdev device creation and device registration" beyond conflicts
>>>>> in the UUID namespace, which is done here. Are there others?
>>>>>
>>>>
>>>> Sorry not being elaborative in my earlier response to
>>>>
>>>>> If we can
>>>>> show that vendor drivers handle the create/remove paths themselves,
>>>>> perhaps we can refine the locking granularity.
>>>>
>>>> mdev_device_create() function does :
>>>> - create mdev device
>>>> - register device
>>>> - call vendor driver->create
>>>> - create sysfs files.
>>>>
>>>> mdev_device_remove() removes sysfs files, unregister device and delete
>>>> device.
>>>>
>>>> There is common code in mdev_device_create() and mdev_device_remove()
>>>> independent of what vendor driver does in its create()/remove()
>>>> callback. Moving this code to each vendor driver to handle create/remove
>>>> themselves doesn't make sense to me.
>>>
>>> I don't see where anyone is suggesting that, I'm not.
>>>
>>>> mdev_parent.lock here does take care of race conditions that could occur
>>>> during mdev device creation and deletion in this common code path.
>>>
>>> Exactly what races in the common code path is mdev_parent.lock
>>> preventing? mdev_device_create() calls:
>>>
>>> device_register()
>>> mdev_device_create_ops()
>>> parent->ops->create()
>>> sysfs_create_groups()
>>> mdev_create_sysfs_files()
>>> sysfs_create_files()
>>> sysfs_create_link()
>>> sysfs_create_link()
>>>
>>> mdev_parent.lock is certainly not serializing all calls across the
>>> entire kernel to device_register and sysfs_create_{groups,files,link}
>>> so what is it protecting other than serializing parent->ops->create()?
>>> Locks protect data, not code. The data we're protecting is the shared
>>> mdev_list, there is no shared data once mdev_device_create() has its
>>> mdev_device with uuid reservation placed into that list.
>>>
>
> Thank you for enumerating these points below.
>
>> This lock prevents race condition that could occur due to sysfs entries
>> 1. between write on 'create' and 'remove' sysfs file of mdev device
>> As per current code without lock, mdev_create_sysfs_files() creates
>> 'remove' sysfs, but before adding this mdev device to mdev_list, if
>> 'remove' is called, that would return -ENODEV even if the device is seen
>> in sysfs
>
> mdev_parent.lock doesn't play a factor here. As it exists today, the
> sysfs remove attribute is added during mdev_device_create() while
> holding mdev_parent.lock, but only after releasing that lock is the
> mdev_device added to mdev_list. mdev_device_remove() only uses the
> mdev_list, so there exists a gap where the remove attribute is visible
> to userspace, but a write to it will return -ENODEV.

Ah, that's right.

> Let's not fool
> ourselves that the current code is bulletproof here, we're debating
> whether it's more reasonable to return -ENODEV or -EAGAIN in the gap
> between the sysfs remove attribute being created and the completion of
> mdev_device_create(). Personally I think -EAGAIN makes more sense,
> which is why I chose to separate it from the -ENODEV return.
>
> Additionally, we can consider the write on 'remove' and write on
> 'remove' case. A second writer to the remove attribute would today see
> -ENODEV, which is incorrect as the device again clearly does exist.
> Furthermore if mdev_device_remove_ops() fails, the mdev_device can be
> re-added to the mdev_list, so a subsequent remove could work.
> Effectively the device disappears and comes back according to the
> remove attribute while in my proposal the user would see a much more
> consistent device status, -EAGAIN if the user is racing another
> remove, allowing the device to return to "found" should
> mdev_device_remove_ops() fail. Again, I this makes more sense to me
> than the existing behavior.
>

Ok. This makes sense to me.

>> 2. between write on 'remove' and 'create' sysfs file
>> If 'remove' of a device is in progress (device is removed from
>> mdev_list but sysfs entries are not yet removed) and again 'create' of
>> same device with same parent is called, will hit duplicate entries
>> error for sysfs.
>
> This is a good point, but the fix is trivial, move the list_del to
> mdev_device_release().
>

Sounds good.

>> 3. between multiple writes on 'create' with same uuid
>> current code doesn't handle the case you are fixing here, if same uuid
>> is used to create mdev device on different parents.
>>
>> Your change handles #1 and #3, but still there is a small gap for #2.
>> Even its a very small gap, but if such conditions are it in production
>> environment, it becomes difficult to debug.
>
> Fixed trivially and arguably better than existing code as the namespace
> is protected through the very end of the lifecycle of the mdev_device.
>
>>>> What is the urge to remove mdev_parent.lock if that handles all race
>>>> conditions without bothering user to handle -EAGAIN?
>>>
>>> Can you say why -EAGAIN is undesirable? Note that the user is only
>>> going to see this error if they attempt to remove the device in the
>>> minuscule gap between the sysfs remove file being created and the
>>> completion of the write to the create sysfs file. It seems like you're
>>> asking that I decrease the locking granularity, but not too much
>>> because mdev_parent.lock protects "things". If the -EAGAIN is really
>>> so terrible, we can avoid it by spinning until the mdev_device is
>>> either not found in the list or becomes active, we don't need
>>> mdev_parent.lock to solve that, but I don't think that's the best
>>> solution and there's no concrete statement to back -EAGAIN being a
>>> problem.
>>
>> Does libvirt handles -EAGAIN error case? I don't know, may be someone
>> from libvirt can comment.
>
> Is this even a valid question given the revelation above? Current code
> has a gap where the user can access remove and see -ENODEV. The
> proposed code has a gap where the user can access remove and see
> -EAGAIN. Either case requires that the user is calling remove prior to
> the device creation being completed, which suggests that userspace
> already has multiple processing stepping on each other. I don't think
> libvirt does this, nor do I think we need to be particularly amenable
> such user code, though I do think the -EAGAIN error is more consistent
> with the actual device state. Thanks,
>

Ok. Make sense.

Thanks,
Kirti