Re: [PATCH v2] vfio/pci: fix memory leak during D3hot to D0 transition

From: Abhishek Sahu
Date: Mon Feb 07 2022 - 14:24:18 EST


On 2/2/2022 5:01 AM, Alex Williamson wrote:
> On Tue, 1 Feb 2022 17:06:43 +0530
> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>
>> On 2/1/2022 1:41 AM, Alex Williamson wrote:
>>> On Mon, 31 Jan 2022 16:54:50 +0530
>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>>
>>>> If needs_pm_restore is set (PCI device does not have support for no
>>>> soft reset), then the current PCI state will be saved during D0->D3hot
>>>> transition and same will be restored back during D3hot->D0 transition.
>>>> For saving the PCI state locally, pci_store_saved_state() is being
>>>> used and the pci_load_and_free_saved_state() will free the allocated
>>>> memory.
>>>>
>>>> But for reset related IOCTLs, vfio driver calls PCI reset related
>>>> API's which will internally change the PCI power state back to D0. So,
>>>> when the guest resumes, then it will get the current state as D0 and it
>>>> will skip the call to vfio_pci_set_power_state() for changing the
>>>> power state to D0 explicitly. In this case, the memory pointed by
>>>> pm_save will never be freed. In a malicious sequence, the state changing
>>>> to D3hot followed by VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be
>>>> run in a loop and it can cause an OOM situation.
>>>>
>>>> Also, pci_pm_reset() returns -EINVAL if we try to reset a device that
>>>> isn't currently in D0. Therefore any path where we're triggering a
>>>> function reset that could use a PM reset and we don't know if the device
>>>> is in D0, should wake up the device before we try that reset.
>>>>
>>>> This patch changes the device power state to D0 by invoking
>>>> vfio_pci_set_power_state() before calling reset related API's.
>>>> It will help in fixing the mentioned memory leak and making sure
>>>> that the device is in D0 during reset. Also, to prevent any similar
>>>> memory leak for future development, this patch frees memory first
>>>> before overwriting 'pm_save'.
>>>>
>>>> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
>>>> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
>>>> ---
>>>>
>>>> * Changes in v2
>>>>
>>>> - Add the Fixes tag and sent this patch independently.
>>>> - Invoke vfio_pci_set_power_state() before invoking reset related API's.
>>>> - Removed saving of power state locally.
>>>> - Removed warning before 'kfree(vdev->pm_save)'.
>>>> - Updated comments and commit message according to updated changes.
>>>>
>>>> * v1 of this patch was sent in
>>>> https://lore.kernel.org/lkml/20220124181726.19174-4-abhsahu@xxxxxxxxxx/
>>>>
>>>> drivers/vfio/pci/vfio_pci_core.c | 27 +++++++++++++++++++++++++++
>>>> 1 file changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index f948e6cd2993..d6dd4f7c4b2c 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -228,6 +228,13 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>>> if (!ret) {
>>>> /* D3 might be unsupported via quirk, skip unless in D3 */
>>>> if (needs_save && pdev->current_state >= PCI_D3hot) {
>>>> + /*
>>>> + * If somehow, the vfio driver was not able to free the
>>>> + * memory allocated in pm_save, then free the earlier
>>>> + * memory first before overwriting pm_save to prevent
>>>> + * memory leak.
>>>> + */
>>>> + kfree(vdev->pm_save);
>>>> vdev->pm_save = pci_store_saved_state(pdev);
>>>> } else if (needs_restore) {
>>>> pci_load_and_free_saved_state(pdev, &vdev->pm_save);
>>>> @@ -322,6 +329,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>>> /* For needs_reset */
>>>> lockdep_assert_held(&vdev->vdev.dev_set->lock);
>>>>
>>>> + /*
>>>> + * This function can be invoked while the power state is non-D0,
>>>> + * Change the device power state to D0 first.
>>>
>>> I think we need to describe more why we're doing this than what we're
>>> doing. We need to make sure the device is in D0 in case we have a
>>> reset method that depends on that directly, ex. pci_pm_reset(), or
>>> possibly device specific resets that may access device BAR resources.
>>> I think it's placed here in the function so that the config space
>>> changes below aren't overwritten by restoring the saved state and maybe
>>> also because the set_irqs_ioctl() call might access device MMIO space.
>>>
>>
>> Thanks Alex.
>> I will add more details here in the comment.
>>
>>>> + */
>>>> + vfio_pci_set_power_state(vdev, PCI_D0);
>>>> +
>>>> /* Stop the device from further DMA */
>>>> pci_clear_master(pdev);
>>>>
>>>> @@ -921,6 +934,13 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>> return -EINVAL;
>>>>
>>>> vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>> +
>>>> + /*
>>>> + * This function can be invoked while the power state is non-D0,
>>>> + * Change the device power state to D0 before doing reset.
>>>> + */
>>>
>>> See below, reconsidering this...
>>>
>>>> + vfio_pci_set_power_state(vdev, PCI_D0);
>>>> +
>>>> ret = pci_try_reset_function(vdev->pdev);
>>>> up_write(&vdev->memory_lock);
>>>>
>>>> @@ -2055,6 +2075,13 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>>> }
>>>> cur_mem = NULL;
>>>>
>>>> + /*
>>>> + * This function can be invoked while the power state is non-D0.
>>>> + * Change power state of all devices to D0 before doing reset.
>>>> + */
>>>
>>> Here I have trouble convincing myself exactly what we're doing. As you
>>> note in patch 1/ of the RFC series, pci_reset_bus(), or more precisely
>>> pci_dev_save_and_disable(), wakes the device to D0 before reset, so we
>>> can't be doing this only to get the device into D0. The function level
>>> resets do the same.
>>>
>>> Actually, now I'm remembering and debugging where I got myself confused
>>> previously with pci_pm_reset(). The scenario was a Windows guest with
>>> an assigned Intel 82574L NIC. When doing a shutdown from the guest the
>>> device is placed in D3hot and we enter vfio_pci_core_disable() in that
>>> state. That function however uses __pci_reset_function_locked(), which
>>> skips the pci_dev_save_and_disable() since much of it is redundant for
>>> that call path (I think I generalized this to all flavors of
>>> pci_reset_function() in my head).
>>
>> Thanks for providing the background related with the original issue.
>>
>>>
>>> The standard call to pci_try_reset_function(), as in the previous
>>> chunk, will make use of pci_dev_save_and_disable(), so for either of
>>> these latter cases the concern cannot be simply having the device in D0,
>>> we need a reason that we want the previously saved state restored on the
>>> device before the reset, and thus restored to the device after the
>>> reset as the rationale for the change.
>>>
>>
>> I will add this as a comment.
>>
>>>> + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
>>>> + vfio_pci_set_power_state(cur, PCI_D0);
>>>> +
>>>> ret = pci_reset_bus(pdev);
>>>>
>>>> err_undo:
>>>
>>>
>>> We also call pci_reset_bus() in vfio_pci_dev_set_try_reset(). In that
>>> case, none of the other devices can be in use by the user, but they can
>>> certainly be in D3hot with previous device state saved off into our
>>> pm_save cache. If we don't have a good reason to restore in that case,
>>> I'm wondering if we really have a good reason to restore in the above
>>> two cases.
>>>
>>> Perhaps we just need the first chunk above to resolve the memory leak,
>>
>> First chunk means only the changes done in vfio_pci_set_power_state()
>> which is calling kfree() before calling pci_store_saved_state().
>> Or I need to include more things in the first patch ?
>
> Correct, first chunk as is the first change in the patch. Patch chunks
> are delineated by the @@ offset lines.
>

Thanks for confirming this.

>>
>> With the kfree(), the original memory leak issue should be solved.
>>
>>> and the second chunk as a separate patch to resolve the issue with
>>> devices entering vfio_pci_core_disable() in non-D0 state. Sorry if I
>>
>> And this second patch will contain rest of the things where
>> we will call vfio_pci_set_power_state() explicitly for moving to
>> D0 state ?
>
> At least the first one in vfio_pci_core_disable(), the others need
> justification.
>

Yes. First one is needed.

>> Also, We need to explore if setting to D0 state is really required at
>> all these places and If it is not required, then we don't need second
>> patch ?
>
> We need a second patch, I'm convinced that we don't otherwise wake the
> device to D0 before we potentially get to pci_pm_reset() in
> vfio_pci_core_disable(). It's the remaining cases of setting D0 that
> I'm less clear on. If it's the case that we need to restore config
> space any time a NoSoftRst- device is woken from D3hot and the state
> saved and restored around the reset is meaningless otherwise, that's a
> valid justification, but is it accurate? If so, we should recheck the
> other case of calling pci_reset_bus() too. Thanks,
>
> Alex
>

I was analyzing this part in detail and added some debug prints and
made user space program to understand it better. Also, I have gone
through the patch 51ef3a004b1e (“vfio/pci: Restore device state on
PM transition”).

We have 2 cases here:

1. The devices which has NoSoftRst+ (needs_pm_restore is false).
This case should work fine for all the cases (Apart from vfio_pci_core_disable())
without waking-up the device explicitly.

2. The devices which has NoSoftRst- (needs_pm_restore is true).

For case 2, let’s consider following example:

a. The device is in D3hot.
b. User made VFIO_DEVICE_RESET ioctl.
c. pci_try_reset_function() will be called which internally
invokes pci_dev_save_and_disable().
d. pci_set_power_state(dev, PCI_D0) will be called first.
e. pci_save_state() will happen then.

Now, for the devices which has NoSoftRst-,
the pci_set_power_state() should trigger soft reset and
we may lose the original state at step (d) and this state
cannot be restored.

For example, lets assume the case, where SBIOS or host
linux kernel (In the aspm.c) enables PCIe LTR setting for the
PCIe device. When this soft reset will be triggered, then this
LTR setting may be reset, and the device state saved at step (e)
will also have this setting cleared so it cannot be restored.
Same thing can be happened for other PCIe capabilities. Since the
vfio driver only exposes limited enhanced capabilities to its user
So, the vfio-driver user also won’t have option to save and
restore these capabilities state and these original
settings will be permanently lost.

So, it seems we need to always move the device explicitly to
D0 state by calling vfio_pci_set_power_state() before
any reset for the reset triggered by IOCTLs. This is mainly to
preserve the state around soft reset.

For vfio_pci_dev_set_try_reset() also, we can have the above
mentioned situation. The other functions/devices can be in D3hot
state and the D0 transition can cause soft reset there also.
For example, in my case, NVIDIA GPU has VGA (func 0) and audio
(func 1) function. I added debug print to dump the current state
before and after pci_reset_bus(). Before pci_reset_bus() the func 1
state was D3hot and after pci_reset_bus() the func 1 state got
changed to D0. This pci_reset_bus() was called during closing of
func 0 device so there are chances of soft reset for other
function/devices.

Thanks,
Abhishek