Re: [patch] Real-Time Preemption, -RT-2.6.12-rc6-V0.7.48-00

From: Ingo Molnar
Date: Tue Jun 28 2005 - 03:44:58 EST



* William Weston <weston@xxxxxxxxx> wrote:

> OK. Running on -50-25 now. The burnP6 starvation doesn't seem to
> affect the whole system, but comes close enough to require the reset
> button every time. I usually, but not always, lose network, X, the
> keyboard, mouse, and serial console. I'm still unable to get any sort
> of a trace from these lockups, since it's looking more like a bunch of
> processes starving than a kernel crash or a full lockup.
>
> Once, with VLC (viewing a 5mbit/s mcast/udp stream) and two burnP6
> instances running, I was able to fire up top on the serial console and
> found out that the IRQ thread for the ns83820 nic was using 99% of one
> CPU.

aha! that's an important clue. It seems you've got a screaming interrupt
or some other loop in ns83820 irq handling. Could you do this:

chrt -o 0 -p `pidof 'IRQ 18'`

(assuming your ns83820 device is still on IRQ18) To check the command
was effective, enter the following command:

chrt -o -p `pidof 'IRQ 18'`

and you should see output like:

pid 748's current scheduling policy: SCHED_OTHER
pid 748's current scheduling priority: 0

i.e. the thread is not SCHED_FIFO anymore.

this will not fix the ns83820 problem for you, but will make it more
debuggable - you will still probably lose networking, but keyboard and
the local console should work fine. You should see a 99% CPU-time
looping ns83820 IRQ thread when the condition triggers. To debug the
condition further, could you do something like:

vmstat 1

what kind of interrupt rate ('in' field) does it report? If it's very
high then it's a screaming interrupt, if it's low then the IRQ thread is
looping for some other reason. (but both would be bugs of the -RT
kernel.)

also, could you try to get a trace of what the IRQ thread is doing. I've
attached trace-it.c, just run it as root (on a LATENCY_TRACING-enabled
-RT kernel) to get a finegrained trace of what's going on in the system.
Whenever the thread-spinning occurs, just run this utility:

trace-it > trace.txt

and you should get a couple of milliseconds worth of system activity.
The trace output should look like a really long latency-trace. (The
latency traces usually compress really well with bzip2 -9, so you can
attach it to public replies too, if compressed - that way others can
have a look too.)

> Once, with a normal desktop load and a yum update, this came across on
> the serial console:
>
> cat/2100[CPU#1]: BUG in update_out_trace at kernel/latency.c:587

on SMP this could occur if the TSCs of different CPUs are too apart from
each other. I'll probably put an automatic check for this into the
/proc/latency_trace code.

Ingo

/*
* Copyright (C) 1999, Ingo Molnar <mingo@xxxxxxxxxx>
*/

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>

int main (int argc, char **argv)
{
int ret;

if (getuid() != 0) {
fprintf(stderr, "needs to run as root.\n");
exit(1);
}
ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null");
if (ret) {
fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
exit(1);
}
system("echo 1 > /proc/sys/kernel/trace_all_cpus");
system("echo 1 > /proc/sys/kernel/trace_enabled");
system("echo 0 > /proc/sys/kernel/trace_freerunning");
system("echo 0 > /proc/sys/kernel/trace_print_at_crash");
system("echo 1 > /proc/sys/kernel/trace_user_triggered");
system("echo 0 > /proc/sys/kernel/trace_verbose");
system("echo 0 > /proc/sys/kernel/preempt_max_latency");
system("echo 0 > /proc/sys/kernel/preempt_thresh");
system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 0 > /proc/sys/kernel/wakeup_timing");
system("echo 1 > /proc/sys/kernel/mcount_enabled");

gettimeofday(0, 1);
usleep(100000);
gettimeofday(0, 0);

system("cat /proc/latency_trace");

return 0;
}


Attachment: trace-it
Description: Binary data