Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let ksoftirqd do its job"

From: Brian Starkey
Date: Tue Nov 22 2016 - 10:27:46 EST


Hi Eric,

On Tue, Nov 22, 2016 at 06:29:33AM -0800, Eric Dumazet wrote:
On Tue, Nov 22, 2016 at 2:33 AM, Brian Starkey <brian.starkey@xxxxxxx> wrote:

Hi,

On Fri, Nov 18, 2016 at 01:40:43AM +0100, Thomas Gleixner wrote:

Brian,

On Thu, 17 Nov 2016, Brian Starkey wrote:

No joy with this patch :-(

I had to add an ioaddr argument because apparently that macro depends
on local context (yuck...), but it doesn't help my issue.

FWIW I don't see any timeouts, either with or without the patch.
(I don't know for sure, but I would guess that the model of the
network card doesn't model whatever stall that loop is checking for.
It probably just completes all MMU operations immediately)


Is there a chance that you enable trace points at the kernel command line?

trace_event=sched_wakeup,sched_switch,irq_handler_entry,irq_handler_exit,softirq_raise,softirq_entry,softirq_exit

should be enough for a start. All we need aside of that is a trigger to
stop the trace so we can actually see the events around the time where
things go stale.

I assume that the whole issue is visible throughout the slow progress of
init towards a working system, so for a start it would be sufficient to add
something like this into the startup sequence at some point:

mount -t debugfs debugfs /sys/kernel/debug
echo 0 >/sys/kernel/debug/tracing/tracing_on

The only interesting challange is to get the trace data out of the
system. The trace is accessible via:

cat /sys/kernel/tracing/trace

So if your ssh works at some point, that might be an option or you just try
to store it over NFS (which will be slow, but better than nothing). Maybe
you have a better idea :)


I finally managed to pry some traces out this morning. It seems like
the system struggles to even invoke echo when it's doing badly.

Trace before 4cd13c21b207: https://drive.google.com/open?id=0B8siaK6ZjvEwU21wNTdZS29kVXc
Trace after 4cd13c21b207: https://drive.google.com/open?id=0B8siaK6ZjvEwbXVzcnpieVkzWFU
(btw, if there's a preferred way to send the logs let me know. I
wasn't sure large or non-text attachments would be well received)

I'm not sure how much help the trace is, but it does look like the
system is spending far too much time in the ethernet device's IRQ
handler to be healthy.


Thanks a lot Brian

Can you confirm interrupt handler is smc911x_interrupt() ?

(ie : is SMC_USE_PXA_DMA / SMC_USE_DMA defined or not ?)


Looks like there's a few similarly named devices and drivers. Mine is
an SMSC LAN91C111 using the smc91x driver in
drivers/net/ethernet/smsc/smc91x.c, rather than smc911x.c. So the
interrupt handler is smc_interrupt()

CONFIG_ARCH_PXA is not set, nor is SMC_USE_PXA_DMA or SMC_USE_DMA.

-Brian



Thanks,
Brian


Thanks,

tglx