RE: [Patch v2 2/2] PCI: hv: Remove unused refcount and supporting functions for handling bus device removal

From: Dexuan Cui
Date: Sat Apr 24 2021 - 22:57:41 EST


> From: Long Li <longli@xxxxxxxxxxxxx>
> Sent: Friday, April 23, 2021 11:40 AM
> To: Dexuan Cui <decui@xxxxxxxxxxxxx>; longli@xxxxxxxxxxxxxxxxx; KY
>
> > Subject: RE: [Patch v2 2/2] PCI: hv: Remove unused refcount and supporting
> > functions for handling bus device removal
> >
> > > From: longli@xxxxxxxxxxxxxxxxx <longli@xxxxxxxxxxxxxxxxx>
> > > Sent: Wednesday, April 21, 2021 10:46 PM
> > >
> > > With the new method of flushing/stopping the workqueue before doing
> > > bus removal, the old mechanisum of using refcount and wait for
> > > completion
> >
> > mechanisum -> mechanism
> >
> > > is no longer needed. Remove those dead code.
> > >
> > > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx>
> >
> > The patch looks good to me. BTW, can we also remove get_pcichild() and
> > put_pcichild() in an extra patch? I suspect we don't really need those either.
>
> Those two functions are for protecting accessing to the devices on the hbus.
> There are interactions from PCI layer that need guarantee from hbus that the
> device is present at the time of access.
>
> Why do you think we don't' need those?

IMO there is proper locking and synchronization logic in the PCI layer, so
I don't think the 'hpdev' struct can vanish when it's being accessed from the
PCI layer.

I think the 'hpdev' struct can only be freed in two scenarios:
1) the PCI device is removed: the VM receives the PCI_EJECT message,
hv_eject_device_work() calls pci_stop_and_remove_bus_device() to deregister
the pci_dev from the PCI layer, and then does other cleanup, and finally
call kfree(hpdev) in the third put_pcichild() in hv_eject_device_work().

2) the pci-hyperv driver is unloaded: in this case, hv_pci_remove() calls
pci_remove_root_bus() to deregister the pci_dev, and then calls
hv_pci_bus_exit() -> hv_pci_start_relations_work(), and eventually
pci_devices_present_work() decreases the ref counter to zero and free
the 'hpdev'.

In both the case, when the hpdev or the pdev is still being used by the
PCI layer, I think the pci_stop_and_remove_bus_device() and
pci_remove_root_bus() should be blocked, i.e. the hpdev can't be freed
even if we don't have the ref counter mechanism.

For example. we know the 'lspci' program can read the PCI device's config
space directly via the sysfile /sys/bus/pci/devices/XXXX/config; when
'lspci' is reading the config space, the code path can be:

do_syscall_64
ksys_pread64
vfs_read
new_sync_read
kernfs_fop_read_iter
kernfs_file_read_iter
kernfs_get_active -> atomic_inc_unless_negative(&kn->active).
pci_read_config
pci_user_read_config_dword
hv_pcifront_read_config
_hv_pcifront_read_config

At this moment, if the host tries to remove the PCI device, the PCI_EJECT
code path will be blocked due to the kn->active:

hv_eject_device_work()
pci_stop_and_remove_bus_device()
pci_stop_bus_device
pci_remove_sysfs_dev_files
kernfs_remove_by_name_ns
__kernfs_remove
kernfs_drain
wait_event(root->deactivate_waitq,
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);

I don't check all the scenarios and code paths, but generally speaking
I suppose the PCI subsystem should already have the proper locking and
synchronization logic we need here.

PS, I'm not sure if the host can remove the device by only sending
a PCI_BUS_RELATIONS message with bus_rel->device_count == 0 (i.e. no
PCI_EJECT message): in this case, if we don't have the ref counter for
hpdev, pci_devices_present_work() frees the hpdev before calling
pci_scan_child_bus() and this can be a potential race condition. But IMO
this can be easily fixed by moving "free the hpdev" to a later place, afer
pci_scan_child_bus() is called.

Thanks,
-- Dexuan