RE: [PATCH 1/3] hv_utils: Add the support of hibernation
From: Dexuan Cui
Date: Thu Sep 19 2019 - 02:34:46 EST
> From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> Sent: Monday, September 16, 2019 1:46 AM
> Dexuan Cui <decui@xxxxxxxxxxxxx> writes:
>
> >> From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> >> Sent: Thursday, September 12, 2019 9:37 AM
> >
> >> > +static int util_suspend(struct hv_device *dev)
> >> > +{
> >> > + struct hv_util_service *srv = hv_get_drvdata(dev);
> >> > +
> >> > + if (srv->util_cancel_work)
> >> > + srv->util_cancel_work();
> >> > +
> >> > + vmbus_close(dev->channel);
> >>
> >> And what happens if you receive a real reply from userspace after you
> >> close the channel? You've only cancelled timeout work so the driver
> >> will not try to reply by itself but this doesn't mean we won't try to
> >> write to the channel on legitimate replies from daemons.
> >>
> >> I think you need to block all userspace requests (hang in kernel until
> >> util_resume()).
> >>
> >> While I'm not sure we can't get away without it but I'd suggest we add a
> >> new HVUTIL_SUSPENDED state to the hvutil state machine.
> >> Vitaly
> >
> > When we reach util_suspend(), all the userspace processes have been
> > frozen: see kernel/power/hibernate.c: hibernate() -> freeze_processes(),
> > so here we can not receive a reply from the userspace daemon.
> >
>
> Let's add a WARN() or something then as if this ever happens this is
> going to be realy tricky to catch.
Looking at the path hibernate() -> freeze_processes() ->
try_to_freeze_tasks(true) -> freeze_task() -> fake_signal_wake_up(), I'm
sure when try_to_freeze_tasks(true) returns 0, all the user-space processes
must be frozen in do_signal() -> get_signal() -> try_to_freeze() -> ... ->
__refrigerator().
hibernate () -> hibernation_snapshot () -> dpm_suspend() -> ... ->
util_suspend() only runs after hibernate() -> freeze_processes(), so I'm
pretty sure we're safe here.
> > However, it looks there is indeed some tricky corner cases we need to deal
> > with: in util_resume(), before we call vmbus_open(), all the userspace
> > processes are still frozen, and the userspace daemon (e.g. hv_kvp_daemon)
> > can be in any of these states:
> >
> > 1. the driver has not buffered any message for the daemon. This is good.
> >
> > 2. the driver has buffered a message for the daemon, and
> > kvp_transaction.state is HVUTIL_USERSPACE_REQ. Later, the kvp daemon
> > writes the response to the driver, and in kvp_on_msg()
> > kvp_transaction.state is moved to HVUTIL_USERSPACE_RECV, but since
> > cancel_delayed_work_sync(&kvp_timeout_work) is false (the work has
> > been cancelled by util_suspend()), the driver reports nothing to the host,
> > which is good as the VM has undergone a hibernation event and IMO the
> > response may be stale and I believe the host is not expecting this
> > response from the VM at all (the host side application that requested the
> > KVP must have failed or timed out), but the bad thing is: the "state"
> > remains in HVUTIL_USERSPACE_RECV, preventing
> > hv_kvp_onchannelcallback() from reading from the channel ringbuffer.
> >
> > I suggest we work around this race condition by the below patch:
> >
> > --- a/drivers/hv/hv_kvp.c
> > +++ b/drivers/hv/hv_kvp.c
> > @@ -250,8 +250,8 @@ static int kvp_on_msg(void *msg, int len)
> > */
> > if (cancel_delayed_work_sync(&kvp_timeout_work)) {
> > kvp_respond_to_host(message, error);
> > - hv_poll_channel(kvp_transaction.recv_channel,
> kvp_poll_wrapper);
> > }
> > + hv_poll_channel(kvp_transaction.recv_channel, kvp_poll_wrapper);
> >
> > return 0;
> > }
> >
> > How do you like this?
> >
>
> Is it safe to call hv_poll_channel() simultaneously on several CPUs? It
> seems it is as we're doing
>
> smp_call_function_single(channel->target_cpu, cb, channel, true);
Looks safe to me. The code has been there for years. :-)
> (I'm asking because if it's not, than doing what you suggest will open
> the following window: timeout function kvp_timeout_func() is already
> running but the daemon is trying to reply at the same moment).
>
> > I can imagine there is still a small chance that the state machine can run
> > out of order, and the kvp daemon may exit due to the return values of
> > read()/write() being -EINVAL, but the chance should be small enough in
> > practice, and IMO the issue even exists in the current code when
> > hibernation is not involved, e.g. kvp_timeout_func() and kvp_on_msg()
> > can run concurrently; if kvp_on_msg() runs slowly due to some reason
> > (e.g. the kvp daemon is stopped as I'm gdb'ing it), kvp_timeout_func()
> > fires and moves the state to HVUTIL_READY; later kvp_on_msg() starts
> > to run and returns -EINVAL, and the kvp daemon will exit().
> >
> > IMHO here it looks extremely difficult to make things flawless (if that's
> > even possible), so it's necessary to ask the daemons to auto-restart once
> > they exit() unexpectedly. This can be achieved by configuring systemd
> > properly for the kvp/vss/fcopy services.
>
> I think we can also teach daemons to ignore -EINVAL or switch to
> something like -EAGAIN in non-fatal cases.
Maybe the driver should return 0 rather than -EINVAL for the kvp daemon
in some scenarios, since kvp is never 100% reliable.
> > I'm not sure introducing a HVUTIL_SUSPENDED state would solve all
> > of the corner cases, but I'm sure that would further complicate the
> > current code, at least to me. :-)
> >
>
> Maybe, if we don't need it than we don't. Basically, I see the only
> advantage in having such state: it makes our tricks to support
> hibernation more visible in the code.
> Vitaly
Let me think about this.
BTW, for vss, maybe the VM should not hibernate if there is a backup
ongoing? -- if the file system is frozen by hv_vss_daemon, and the VM
hibernates, then when the VM resumes back, it's almost always true that
the VM won't receive the host's VSS_OP_THAW request, and the VM will
end up in an unusable state.
Thanks,
-- Dexuan