Re: [PATCH v2] xen/xenbus: make xs_talkv() interruptible

From: Andrew Cooper
Date: Thu Dec 17 2020 - 13:27:06 EST


On 16/12/2020 08:21, Jürgen Groß wrote:
> On 15.12.20 21:59, Andrew Cooper wrote:
>> On 15/12/2020 11:10, Juergen Gross wrote:
>>> In case a process waits for any Xenstore action in the xenbus driver
>>> it should be interruptible by signals.
>>>
>>> Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
>>> ---
>>> V2:
>>> - don't special case SIGKILL as libxenstore is handling -EINTR fine
>>> ---
>>>   drivers/xen/xenbus/xenbus_xs.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/xenbus/xenbus_xs.c
>>> b/drivers/xen/xenbus/xenbus_xs.c
>>> index 3a06eb699f33..17c8f8a155fd 100644
>>> --- a/drivers/xen/xenbus/xenbus_xs.c
>>> +++ b/drivers/xen/xenbus/xenbus_xs.c
>>> @@ -205,8 +205,15 @@ static bool test_reply(struct xb_req_data *req)
>>>     static void *read_reply(struct xb_req_data *req)
>>>   {
>>> +    int ret;
>>> +
>>>       do {
>>> -        wait_event(req->wq, test_reply(req));
>>> +        ret = wait_event_interruptible(req->wq, test_reply(req));
>>> +
>>> +        if (ret == -ERESTARTSYS && signal_pending(current)) {
>>> +            req->msg.type = XS_ERROR;
>>> +            return ERR_PTR(-EINTR);
>>> +        }
>>
>> So now I can talk fully about the situations which lead to this, I think
>> there is a bit more complexity.
>>
>> It turns out there are a number of issues related to running a Xen
>> system with no xenstored.
>>
>> 1) If a xenstore-write occurs during startup before init-xenstore-domain
>> runs, the former blocks on /dev/xen/xenbus waiting for xenstored to
>> reply, while the latter blocks on /dev/xen/xenbus_backend when trying to
>> tell the dom0 kernel that xenstored is in dom1.  This effectively
>> deadlocks the system.
>
> This should be easy to solve: any request to /dev/xen/xenbus should
> block upfront in case xenstored isn't up yet (could e.g. wait
> interruptible until xenstored_ready is non-zero).

I'm not sure that that would fix the problem.  The problem is that
setting the ring details via /dev/xen/xenbus_backend blocks, which
prevents us launching the xenstored stubdomain, which prevents the
earlier xenbus write being completed.

So long as /dev/xen/xenbus_backend doesn't block, there's no problem
with other /dev/xen/xenbus activity being pending briefly.


Looking at the current logic, I'm not completely convinced.  Even
finding a filled-in evtchn/gfn doesn't mean that xenstored is actually
ready.

There are 3 possible cases.

1) PV guest, and details in start_info
2) HVM guest, and details in HVM_PARAMs
3) No details (expected for dom0).  Something in userspace must provide
details at a later point.

So the setup phases go from nothing, to having ring details, to finding
the ring working.

I think it would be prudent to try reading a key between having details
and declaring the xenstored_ready.  Any activity, even XS_ERROR,
indicates that the other end of the ring is listening.

>
>> 2) If xenstore-watch is running when xenstored dies, it spins at 100%
>> cpu usage making no system calls at all.  This is caused by bad error
>> handling from xs_watch(), and attempting to debug found:
>
> Can you expand on "bad error handling from xs_watch()", please?

do_watch() has

    for ( ... ) { // defaults to an infinite loop
        vec = xs_read_watch();
        if (vec == NULL)
            continue;
        ...
    }


My next plan was to experiment with break instead of continue, which
I'll get to at some point.

>
>>
>> 3) (this issue).  If anyone starts xenstore-watch with no xenstored
>> running at all, it blocks in D in the kernel.
>
> Should be handled with solution for 1).
>
>>
>> The cause is the special handling for watch/unwatch commands which,
>> instead of just queuing up the data for xenstore, explicitly waits for
>> an OK for registering the watch.  This causes a write() system call to
>> block waiting for a non-existent entity to reply.
>>
>> So while this patch does resolve the major usability issue I found (I
>> can't even SIGINT and get my terminal back), I think there are issues.
>>
>> The reason why XS_WATCH/XS_UNWATCH are special cased is because they do
>> require special handling.  The main kernel thread for processing
>> incoming data from xenstored does need to know how to associate each
>> async XS_WATCH_EVENT to the caller who watched the path.
>>
>> Therefore, depending on when this cancellation hits, we might be in any
>> of the following states:
>>
>> 1) the watch is queued in the kernel, but not even sent to xenstored yet
>> 2) the watch is queued in the xenstored ring, but not acted upon
>> 3) the watch is queued in the xenstored ring, and the xenstored has seen
>> it but not replied yet
>> 4) the watch has been processed, but the XS_WATCH reply hasn't been
>> received yet
>> 5) the watch has been processed, and the XS_WATCH reply received
>>
>> State 5 (and a little bit) is the normal success path when xenstored has
>> acted upon the request, and the internal kernel infrastructure is set up
>> appropriately to handle XS_WATCH_EVENTs.
>>
>> States 1 and 2 can be very common if there is no xenstored (or at least,
>> it hasn't started up yet).  In reality, there is either no xenstored, or
>> it is up and running (and for a period of time during system startup,
>> these cases occur in sequence).
>
> Yes. this is the reason we can't just reject a user request if xenstored
> hasn't been detected yet: it could be just starting.

Right, and I'm not suggesting that we'd want to reject accesses while
xenstored is starting up.

>
>>
>> As soon as the XS_WATCH event has been written into the xenstored ring,
>> it is not safe to cancel.  You've committed to xenstored processing the
>> request (if it is up).
>
> I'm not sure this is true. Cancelling it might result in a stale watch
> in xenstored, but there shouldn't be a problem related to that. In case
> that watch fires the event will normally be discarded by the kernel as
> no matching watch is found in the kernel's data. In case a new watch
> has been setup with the same struct xenbus_watch address (which is used
> as the token), then this new watch might fire without the node of the
> new watch having changed, but spurious watch events are defined to be
> okay (OTOH the path in the event might look strange to the handler).

Watches are a quota'd resource in (at least some) xenstored
configurations.  Losing track of the registration is a resource leak,
even if the kernel can filter and discard the unexpected watch events.

>> If xenstored is actually up and running, its fine and necessary to
>> block.  The request will be processed in due course (timing subject to
>> the client and server load).  If xenstored isn't up, blocking isn't ok.
>>
>> Therefore, I think we need to distinguish "not yet on the ring" from "on
>> the ring", as our distinction as to whether cancelling is safe, and
>> ensure we don't queue anything on the ring before we're sure xenstored
>> has started up.
>>
>> Does this make sense?
>
> Basically, yes.

Great.  If I get any time, I'll try to look into some fixes along the
above lines.

~Andrew