Re: Regression: Failed boots bisected to 4cd13c21b207 "softirq: Let ksoftirqd do its job"

From: Brian Starkey
Date: Wed Nov 16 2016 - 13:02:08 EST


Hi Eric,

On Wed, Nov 16, 2016 at 07:52:42AM -0800, Eric Dumazet wrote:
On Wed, Nov 16, 2016 at 5:55 AM, Brian Starkey <brian.starkey@xxxxxxx> wrote:
Hi,

I'm running an ARM FVP (virtual platform - simluated hardware), which
is failing to reach a login prompt due to extremely slow progress
during boot. systemd gives up waiting for the ttyAMA0 device to
appear, and never starts the getty.

I've bisected this to commit 4cd13c21b207 "softirq: Let ksoftirqd do
its job".

Without this commit, the system boots to a login prompt in 2 minutes.
With this commit, the system eventually manages to bring up sshd after
22 minutes, but as mentioned, the dev-ttyAMA0.device unit has timed
out and so I don't get a prompt on my console.

I only hit the issue when my rootfs is mounted over NFS, and with only
a single core enabled. The (simulated) network device is an SMC91C111.
With multiple cores enabled or a non-NFS filesystem, everything seems
to work OK.

I don't have an identical real hardware platform to try, but I
could not reproduce it on a real ARM Juno board, which is similar.

It looks from the logs that udev's workers are unable to make
progress, so the device nodes don't get created. Don't pay too much
attention to the timestamps in the logs below, they are "inside" the
virtual platform, and don't reflect wall-clock time.
Log before 4cd13c21b207:
https://drive.google.com/open?id=0B8siaK6ZjvEwMktoa0NUS2hJd1U
Log after 4cd13c21b207:
https://drive.google.com/open?id=0B8siaK6ZjvEwZXlfeFFSQl9xZTQ
Kernel config: arch/arm64/configs/defconfig

I'm not sure how to debug this further, so if you have any suggestions
I'd be glad to hear them.

Many thanks,
Brian


Hi Brian.

Thanks a lot for this report.

If issue triggers when/if using one core, it is possible one driver
has a dependency on
softirqs being serviced during an initialization loop.

If the thread is not yielding cpu (holding something like a spinlock
thus disabling preemption),
then ksoftirqd might not be able to run on the (same) cpu.


The smc91x driver does seem to have some trickiness around softirqs.
I'm not familiar with net drivers, but I'll see if I can figure
anything out there.

I sent a patch for busy polling yesterday, but I am almost certain
this would not fix your issue
(assuming you have CONFIG_PREEMPT)

https://patchwork.ozlabs.org/patch/695185/

You're right in saying that this didn't help.

Thanks,
Brian