kernel BUG at kernel/sched_rt.c:493!
From: Shawn Bohrer
Date: Sat Jan 05 2013 - 12:46:30 EST
We recently managed to crash 10 of our test machines at the same time.
Half of the machines were running a 3.1.9 kernel and half were running
3.4.9. I realize that these are both fairly old kernels but I've
skimmed the list of fixes in the 3.4.* stable series and didn't see
anything that appeared to be relevant to this issue.
All we managed to get was some screenshots of the stacks from the
consoles. On one of the 3.1.9 machines you can see we hit the
BUG_ON(want) statement in __disable_runtime() at
kernel/sched_rt.c:493, and all of the machines had essentially the
same stack showing:
Here is one of the screenshots of the 3.1.9 machines:
And here is one from a 3.4.9 machine:
Three of the five 3.4.9 machines also managed to print
"[sched_delayed] sched: RT throttling activated" ~7 minutes before the
machines locked up.
I've tried reproducing the issue, but so far I've been unsuccessful
but I believe that is because my RT tasks aren't using enough CPU
cause borrowing from the other runqueues. Normally our RT tasks use
very little CPU so I'm not entirely sure what conditions caused them
to run into throttling on the day that this happened.
The details that I do know about the workload that caused this are as
1) These are all dual socket 4 core X5460 systems with no
hyperthreading. Thus there are 8 cores total in the system.
2) We use the cpuset cgroup to apply CPU affinity to various types of
processes. Initially everything starts out in a single cpuset and the
top level cpuset has cpuset.sched_load_balance=1 thus there is only a
single scheduling domain.
3) In this case tasks were then placed into four non overlapping
cpusets. 1 containing a single core and single SCHED_FIFO task, 2
containing two cores, and multiple SCHED_FIFO tasks, and 1 containing
3 cores and everything else on the system running as SCHED_OTHER.
4) In the case of cpusets that contain SCHED_FIFO tasks, the tasks
start out as SCHED_OTHER are placed into the cpuset then change their
policy to SCHED_FIFO.
5) Once all tasks are placed into non overlapping cpusets the top
level cpuset.sched_load_balance is set to 0 to split the system into
four scheduling domains.
6) The system ran like this for some unknown amount of time.
7) All the processes are then sent a signal to exit, and at the same
time the top level cpuset.sched_load_balance is set back to 1. This
is when the systems locked up.
Hopefully that is enough information to give someone more familiar
with the scheduler code an idea of where the bug is here. I will
point out that in step #5 above there is a small window where the RT
tasks could encounter runtime limits but are still in a single big
scheduling domain. I don't know if that is what happened or if it is
simply sufficient to hit the runtime limits while the system is split
into four domains. For the curious we are using the default RT
# grep . /proc/sys/kernel/sched_rt_*
Let me know if you anyone needs any more information about this issue.
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/