RE: [PATCH] PCI: hv: Do not wait forever on a device that has disappeared

From: Michael Kelley (EOSG)
Date: Mon May 28 2018 - 20:20:05 EST


>
> Before the guest finishes the device initialization, the device can be
> removed anytime by the host, and after that the host won't respond to
> the guest's request, so the guest should be prepared to handle this
> case.
>
> Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
> Cc: Stephen Hemminger <sthemmin@xxxxxxxxxxxxx>
> Cc: K. Y. Srinivasan <kys@xxxxxxxxxxxxx>
> ---
> drivers/pci/host/pci-hyperv.c | 46 ++++++++++++++++++++++++++++++++-----------
> 1 file changed, 34 insertions(+), 12 deletions(-)
>

While this patch solves the immediate problem of getting hung waiting
for a response from Hyper-V that will never come, there's another scenario
to look at that I think introduces a race. Suppose the guest VM issues a
vmbus_sendpacket() request in one of the cases covered by this patch,
and suppose that Hyper-V queues a response to the request, and then
immediately follows with a rescind request. Processing the response will
get queued to a tasklet associated with the channel, while processing the
rescind will get queued to a tasklet associated with the top-level vmbus
connection. From what I can see, the code doesn't impose any ordering
on processing the two. If the rescind is processed first, the new
wait_for_response() function may wake up, notice the rescind flag, and
return an error. Its caller will return an error, and in doing so pop the
completion packet off the stack. When the response is processed later,
it will try to signal completion via a completion packet that no longer
exists, and memory corruption will likely occur.

Am I missing anything that would prevent this scenario from happening?
It is admittedly low probability, and a solution seems non-trivial. I haven't
looked specifically, but a similar scenario is probably possible with the
drivers for other VMbus devices. We should work on a generic solution.

Michael