Re: [PATCH] sched/deadline: Fix bad accounting of nr_running

From: Juri Lelli
Date: Mon Feb 17 2014 - 10:47:57 EST

Next message: Juri Lelli: "Re: [PATCH] sched/deadline: Test for CPU's presence explicitly"
Previous message: Thomas Glanzmann: "Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]"
Next in thread: Steven Rostedt: "[PATCH v3] sched/deadline: Fix bad accounting of nr_running"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 02/15/2014 05:59 AM, Steven Rostedt wrote:
> My test suite was locking up hard when enabling mmiotracer. This was due
> to the mmiotracer placing all but one CPU offline. I found this out
> when I was able to reproduce the bug with just my stress-cpu-hotplug
> test. This bug baffled me because it would not always trigger, and
> would only trigger on the first run after boot up. The
> stress-cpu-hotplug test would crash hard the first run, or never crash
> at all. But a new reboot may cause it to crash on the first run again.
>
> I spent all week bisecting this, as I couldn't find a consistent
> reproducer. I finally narrowed it down to the sched deadline patches,
> and even more peculiar, to the commit that added the sched
> deadline boot up self test to the latency tracer. Then it dawned on me
> to what the bug was.
>
> All it took was to run a task under sched deadline to screw up the CPU
> hot plugging. This explained why it would lock up only on the first run
> of the stress-cpu-hotplug test. The bug happened when the boot up self
> test of the schedule latency tracer would test a deadline task. The
> deadline task would corrupt something that would cause CPU hotplug to
> fail. If it didn't corrupt it, the stress test would always work
> (there's no other sched deadline tasks that would run to cause
> problems). If it did corrupt on boot up, the first test would lockup
> hard.
>
> I proved this theory by running my deadline test program on another box,
> and then run the stress-cpu-hotplug test, and it would now consistently
> lock up. I could run stress-cpu-hotplug over and over with no problem,
> but once I ran the deadline test, the next run of the
> stress-cpu-hotplug would lock hard.
>
> After adding lots of tracing to the code, I found the cause. The
> function tracer showed that migrate_tasks() was stuck in an infinite
> loop, where rq->nr_running never equaled 1 to break out of it. When I
> added a trace_printk() to see what that number was, it was 335 and
> never decrementing!
>
> Looking at the deadline code I found:
>
> static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int
> flags) {
> dequeue_dl_entity(&p->dl);
> dequeue_pushable_dl_task(rq, p);
> }
>
> static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int
> flags) {
> update_curr_dl(rq);
> __dequeue_task_dl(rq, p, flags);
>
> dec_nr_running(rq);
> }
>
> And this:
>
> if (dl_runtime_exceeded(rq, dl_se)) {
> __dequeue_task_dl(rq, curr, 0);
> if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
> dl_se->dl_throttled = 1;
> else
> enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);
>
> if (!is_leftmost(curr, &rq->dl))
> resched_task(curr);
> }
>
> Notice how we call __dequeue_task_dl() and in the else case we
> call enqueue_task_dl()? Also notice that dequeue_task_dl() has
> underscores where enqueue_task_dl() does not. The enqueue_task_dl()
> calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is
> where we get nr_running out of sync.
>

Right. I'd add another place that could cause this misalignment:

static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
{
[snip]
dl_se->dl_throttled = 0;
if (p->on_rq) {
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (task_has_dl_policy(rq->curr))
check_preempt_curr_dl(rq, p, 0);
else
resched_task(rq->curr);
[snip]
}

This is called when the replenishment timer for a throttled task fires,
and we have to queue it back on the dl_rq with recharged parameters.
Best test for this bug is to run a while(1) task with SCHED_DEADLINE
(using for example https://github.com/jlelli/schedtool-dl). This causes
a lot of throttle/replenish events and causes nr_running to explode.
All is ok with this fix.

> By moving the dec_nr_running() from dequeue_task_dl() to
> __dequeue_task_dl(), everything works again. That is, I can run the
> deadline test program and then run the stress-cpu-hotplug() and all
> would be fine.
>

Rationale for this odd behavior is that, when a task is throttled, it
is removed only from the dl_rq, but we keep it on_rq (as this is not
a "full dequeue", that is the task is not actually sleeping). But, it
is also true that, while throttled a task behaves like it is sleeping
(e.g., its timer will fire on a new CPU if the old one is dead). So,
Steven's fix sounds also semantically correct.

Thanks!

Best,

- Juri

> For reference on my test programs:
>
> http://rostedt.homelinux.com/private/stress-cpu-hotplug
> http://rostedt.homelinux.com/private/deadline.c
>
> Signed-off-by: Steven Rostedt <rostedt@xxxxxxxxxxx>
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0dd5e09..84c2454 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -844,14 +844,14 @@ static void __dequeue_task_dl(struct rq *rq,
> struct task_struct *p, int flags) {
> dequeue_dl_entity(&p->dl);
> dequeue_pushable_dl_task(rq, p);
> +
> + dec_nr_running(rq);
> }
>
> static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int
> flags) {
> update_curr_dl(rq);
> __dequeue_task_dl(rq, p, flags);
> -
> - dec_nr_running(rq);
> }
>
> /*
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Juri Lelli: "Re: [PATCH] sched/deadline: Test for CPU's presence explicitly"
Previous message: Thomas Glanzmann: "Re: REGRESSION f54b311142a92ea2e42598e347b84e1655caf8e3 tcp auto corking slows down iSCSI file system creation by factor of 70 [WAS: 4 TB VMFS creation takes 15 minutes vs 26 seconds]"
Next in thread: Steven Rostedt: "[PATCH v3] sched/deadline: Fix bad accounting of nr_running"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]