Re: wait_for_completion_timeout() spurious failure under heavyload?

From: Peter Zijlstra
Date: Fri Jun 20 2008 - 07:31:07 EST


On Fri, 2008-06-20 at 13:20 +0200, Ingo Molnar wrote:
> * Jiri Slaby <jirislaby@xxxxxxxxx> wrote:
>
> >> do {
> >> timeout = schedule_timeout(timeout);
> >>
> >> if (!timeout)
> >> return timeout;
> >>
> >> } while (!x->done);
> >>
> >> return timeout;
> >> }
> >>
> >> so if the system is very busy and x->done is not set when
> >> do_wait_for_common() is entered, it is possible that the first call to
> >> schedule_timeout() returns 0 because the task doing wait_for_completion
> >
> > Sorry, but how can schedule_timeout return 0 before the timeout
> > expiration?
>
> the point would be that due to high load, the completion wakeup happens
> first, but then due to scheduling delays the timeout also occurs
> (later), before the wakeup related to x->done has managed to do its
> task.
>
> I.e. due to scheduling delays we report a spurious "timeout" failure,
> despite the completion occuring before the timeout. The timeout is
> really intended to be related to the delay of the completion event, not
> the delay of scheduling after that event happened.
>
> seems like a well-spotted race to me, i agree it's more robust to ignore
> the timeout if we can make progress on x->done, and return a 1 jiffy
> 'timeout remaining' value. Oleg, do you concur?
>
> but i'd do it not like this:
>
> > if (!timeout) {
> > __remove_wait_queue(&x->wait, &wait);
> > - return timeout;
> > + if (x->done) {
> > + x->done--;
> > + return 1;
> > + } else {
> > + return 0;
> > + }
>
> but like in the commit below. Agreed?
>
> Ingo
>
> -------------------->
> commit bb10ed0994927d433f6dbdf274fdb26cfcf516b7
> Author: Roland Dreier <rdreier@xxxxxxxxx>
> Date: Thu Jun 19 15:04:07 2008 -0700
>
> sched: fix wait_for_completion_timeout() spurious failure under heavy load
>
> It seems that the current implementaton of wait_for_completion_timeout()
> has a small problem under very high load for the common pattern:
>
> if (!wait_for_completion_timeout(&done, timeout))
> /* handle failure */
>
> because the implementation very roughly does (lots of code deleted to
> show the basic flow):
>
> static inline long __sched
> do_wait_for_common(struct completion *x, long timeout, int state)
> {
> if (x->done)
> return timeout;
>
> do {
> timeout = schedule_timeout(timeout);
>
> if (!timeout)
> return timeout;
>
> } while (!x->done);
>
> return timeout;
> }
>
> so if the system is very busy and x->done is not set when
> do_wait_for_common() is entered, it is possible that the first call to
> schedule_timeout() returns 0 because the task doing wait_for_completion
> doesn't get rescheduled for a long time, even if it is woken up early
> enough.
>
> In this case, wait_for_completion_timeout() returns 0 without even
> checking x->done again, and the code above falls into its failure case
> purely for scheduler reasons, even if the hardware event or whatever was
> being waited for happened early enough.
>
> It would make sense to add an extra test to do_wait_for() in the timeout
> case and return 1 if x->done is actually set.
>
> A quick audit (not exhaustive) of wait_for_completion_timeout() callers
> seems to indicate that no one actually cares about the return value in
> the success case -- they just test for 0 (timed out) versus non-zero
> (wait succeeded).
>
> Signed-off-by: Ingo Molnar <mingo@xxxxxxx>

Good catch,

Acked-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 4a3cb06..577f160 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4405,6 +4405,16 @@ do_wait_for_common(struct completion *x, long timeout, int state)
> spin_unlock_irq(&x->wait.lock);
> timeout = schedule_timeout(timeout);
> spin_lock_irq(&x->wait.lock);
> +
> + /*
> + * If the completion has arrived meanwhile
> + * then return 1 jiffy time left:
> + */
> + if (x->done && !timeout) {
> + timeout = 1;
> + break;
> + }
> +
> if (!timeout) {
> __remove_wait_queue(&x->wait, &wait);
> return timeout;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/