Re: 2.6.14-rc4-rt7

From: Ingo Molnar
Date: Fri Oct 21 2005 - 22:58:51 EST



* Fernando Lopez-Lezcano <nando@xxxxxxxxxxxxxxxxxx> wrote:

> Here's one with rc5-rt3:
>
> Oct 21 15:01:46 cmn3 kernel: BUG: ktimer expired short without user
> signal! (hald-addon-stor:4309)

and no "BUG: foo:1234 waking up bar:4321, expiring ktimer short" message
prior to that? Very weird, this line:

> Oct 21 15:01:46 cmn3 kernel: .. expires: 1012/751245500
> Oct 21 15:01:46 cmn3 kernel: .. expired: 1012/750908115
> Oct 21 15:01:46 cmn3 kernel: .. at line: 942

suggests that the ktimer was expired by ktimer_try_to_cancel() /
ktimer_cancel(), in ktimer_schedule(). I.e. something must have woken
the task early. Probably this theory of mine is incorrect then. I'll try
extend the debug info a bit: it would be interesting to see a 'timer
inserted at' timestamp as well (was it shortly before the problem
happened?), and a 'which PID cancelled the timer' info.

a heavy-hitting but complex-to-set-up solution would be to add a serial
console, and to enable WAKEUP_TIMING+LATENCY_TRACING in the .config, and
to edit kernel/latency.c to initialize the default value of the
following variables:

int wakeup_timing = 0;
int trace_all_cpus = 1;
int trace_freerunning = 1;
int trace_print_at_crash = 1;
int trace_user_triggered = 1;

these variables are in the top portion of latency.c. Important: if you
try this then you should probably also enable IGNORE_PRINTK_LOGLEVEL,
which will improve mass-output to the serial console. Another important
thing is to add a stop_trace() call to kernel/ktimers.c's
check_ktimer_signal() function:

unlock_ktimer_base(timer, &flags);

stop_trace();
printk("BUG: ktimer expired short without user signal! (%s:%d)\n",
current->comm, current->pid);

(otherwise all the trace output you'd be getting would be boring printk
related trace entries.)

this will cause the dump_stack() to also output thousands of trace
entries - all the kernel activity (from all CPUs) that preceded the
ktimer problem. Hopefully this pinpoints the bug.

> In both cases the machine goes catatonic, I don't know if right after
> this or not. It responds to the SysRQ key but that's pretty much it, I
> should probably try to get a serial console going somehow.

would it be easy for you to try the UP kernel? One possibility is that
this is some sort of SMP/APIC-timer related problem.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/