RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer failure

From: Dexuan Cui
Date: Fri Nov 28 2014 - 03:36:32 EST


> -----Original Message-----
> From: Jason Wang [mailto:jasowang@xxxxxxxxxx]
> Sent: Friday, November 28, 2014 14:47 PM
> To: Dexuan Cui
> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; driverdev-
> devel@xxxxxxxxxxxxxxxxxxxxxx; olaf@xxxxxxxxx; apw@xxxxxxxxxxxxx; KY
> Srinivasan; vkuznets@xxxxxxxxxx; Haiyang Zhang
> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer
> failure
> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@xxxxxxxxxxxxx> wrote:
> > In the case the user-space daemon crashes, hangs or is killed, we
> > need to down the semaphore, otherwise, after the daemon starts next
> > time, the obsolete data in fcopy_transaction.message or
> > fcopy_transaction.fcopy_msg will be used immediately.
> >
> > Cc: Jason Wang <jasowang@xxxxxxxxxx>
> > Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> > Cc: K. Y. Srinivasan <kys@xxxxxxxxxxxxx>
> > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
> > ---
> >
> > v2: I removed the "FCP" prefix as Greg asked.
> >
> > I also updated the output message a little:
> > "FCP: failed to acquire the semaphore" -->
> > "can not acquire the semaphore: it is benign"
> >
> > v3: I added the code in fcopy_release() as Jason Wang suggested.
> > I removed the pr_debug (it isn't so meaningful)and added a
> > comment instead.
> >
> > drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++
> > 1 file changed, 19 insertions(+)
> >
> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c
> > index 23b2ce2..faa6ba6 100644
> > --- a/drivers/hv/hv_fcopy.c
> > +++ b/drivers/hv/hv_fcopy.c
> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct
> > *dummy)
> > * process the pending transaction.
> > */
> > fcopy_respond_to_host(HV_E_FAIL);
> > +
> > + /* In the case the user-space daemon crashes, hangs or is killed, we
> > + * need to down the semaphore, otherwise, after the daemon starts
> > next
> > + * time, the obsolete data in fcopy_transaction.message or
> > + * fcopy_transaction.fcopy_msg will be used immediately.
> > + *
> > + * NOTE: fcopy_read() happens to get the semaphore (very rare)?
> > We're
> > + * still OK, because we've reported the failure to the host.
> > + */
> > + if (down_trylock(&fcopy_transaction.read_sema))
> > + ;
>
> Sorry, I'm not quite understand how if () ; can help here.
>
> Btw, a question not relate to this patch.
>
> What happens if a daemon is resume from SIGSTOP and expires the check
> here?
Hi Jason,
My idea is: here we need down_trylock(), but in case we can't get the
semaphore, it's OK anyway:

Scenario 1):
1.1: when the daemon is blocked on the pread(), the daemon receives SIGSTOP;
1.2: the host user runs the PowerShell Copy-VMFile command;
1.3.1: the driver reports the failure to the host user in 5s and
1.3.2: the driver down()-es the semaphore;
1.4: the daemon receives SIGCONT and it will be still blocked on the pread().
Without the down_trylock(), in 1.4, the daemon can receive an obsolete message.
NOTE: in this scenario, the daemon is not killed.

Scenario 2):
In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 and
do down() in fcopy_read(), it will receive the message but: the driver has
reported the failure to the host user and the driver's 1.3.2 can't get the
semaphore -- IMO this is acceptably OK, though in the VM, an incomplete
file will be left there.
BTW, I think in the daemon's hv_start_fcopy() we should add a
close(target_fd) before open()-ing a new one.

> >
> > +
> > }
> >
> > static int fcopy_handle_handshake(u32 version)
> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode,
> > struct file *f)
> > */
> > in_hand_shake = true;
> > opened = false;
> > +
> > + if (cancel_delayed_work_sync(&fcopy_work)) {
> > + /* We haven't up()-ed the semaphore(very rare)? */
> > + if (down_trylock(&fcopy_transaction.read_sema))
> > + ;
>
> And this.

Scenario 3):
When the daemon exits(e.g., SIGKILL received), if there is a fcopy_work
pending (scheduled but not start to run yet), we should cancel the
work (as you suggested) and down() the semaphore, otherwise, the
obsolete message will be received by the next instance of the daemon.

Scenario 4): in the driver's hv_fcopy_onchannelcallback():
schedule_delayed_work(&fcopy_work, 5*HZ);
----> if fcopy_release() is running on another vcpu, just before the next line?
fcopy_send_data();

In this case, fcopy_release() can cancel fcopy_work, but
can't get the semaphore since it hasn't been up()-ed.
Hmm, in this case, fcopy_send_data() will do up() later, and we'll
buffer an obsolete message in the driver, and the message will be
fetched by the next instance of the daemon...

Looks we need a spinlock here?

Please share your idea.

> >
> > + fcopy_respond_to_host(HV_E_FAIL);
> > + }
> > return 0;
> > }
> >
> > --
> > 1.9.1
> >

-- Dexuan