Re: Possible scheduler bug

From: Mike Galbraith
Date: Tue Aug 23 2011 - 23:43:42 EST


On Tue, 2011-08-23 at 20:58 -0500, seth bollinger wrote:
> Hello All,
>
> We recently ran into an interesting scheduler problem when testing one
> of our products. It manifested itself as a user space lockup. When I
> enabled/printed scheduler stats I noticed that the scheduler was
> always picking the same task to run, and no task stats were being
> updated(clock, sum_exec, sum_sleep, etc.). The scheduler would become
> stuck in this state permanently. This problem was ultimately resolved
> by the following patch to sched.c
>
> @@ -564,7 +569,7 @@ void check_preempt_curr(struct rq *rq, struct
> task_struct *p, int flags)
> * A queue event has occurred, and we're going to schedule. In
> * this case, we can save a useless back to back clock update.
> */
> - if (test_tsk_need_resched(p))
> + if (rq->curr->se.on_rq && test_tsk_need_resched(rq->curr))
> rq->skip_clock_update = 1;
> }

Yeah, that's correct, but see f26f9aff6aaf67e9a430d16c266f91b13a5bff64.
You'll also want the other bits as well. (but not the WARN_ON())

> I have two questions regarding this patch.
>
> 1. How was it possible to get the scheduler locked up like that (prior
> to patch application)?

If the clock isn't updated, vruntimes don't advance, so you could end up
selecting the same task repeatedly.

> 2. After patch, is it possible that the scheduler could spin in this
> loop until a sched_clock() tick (our clock resolution is unfortunately
> 10ms)?

If you take the rest of the fix, that shouldn't happen.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/