Re: [Query] Preemption (hogging) of the work handler
From: Viresh Kumar
Date: Mon Jul 11 2016 - 18:35:13 EST
Hi Sergey and Jan,
On 12-07-16, 00:44, Sergey Senozhatsky wrote:
> right. apart from cases when the existing console_unlock() behaviour can
> simply "block" a process to flush the log_buf to slow serial consoles
> (regardless the process execution context) and make the system less
> responsive, I have around ~10 absolutely different scenarios on my list that
> may cause soft/hard lockups, rcu stalls, oom-s, etc. and console_unlock() is
> the root cause there. the simplest ones involve heavy printk() usage, the
> trickier ones do not necessarily have anything that is abusing printk(): a
> moderate printk() pressure coming from other CPUs on the system and more or
> less active tty -> UART can do the trick, because uart interrupt service
> routine and call_console_drivers()->write() have to compete for the same
> uart port spin_lock. soft lockups are probably the most common problems,
> though, it's not all that easy to catch, because watchdog does not ring
> the bell straight after preempt_enable(), but from hrtimer interrupt, that
> happens approx every 4 seconds. by this time CPU can be somewhere far away
> from console_unlock(). I had an idea of doing watchdog soft lockup check
> from preempt_enable(), when it brings preempt_count down to zero, but not
> sure I can recall how well did it go.
Thanks for your feedback guys, and I have one more blocking issue
where I need your help/advice.
So, the excess printing in our case is done in parallel to system
suspend. And that can very much happen after all the non-boot CPUs are
offlined.
Sometimes, the platform doesn't come back after suspend. I have tried
enabling no-console-suspend and the last line it prints is:
Disabling non-boot CPUs
And nothing after that at all. We have to forcefully reboot the phone
after that. Moving the prints to they synchronous way (using
echo 1 > /sys/module/printk/parameters/synchronous), fixes that issue.
So, the asynchronous printing have a issue that only we are hitting.
It looks like that all the CPUs are gone except CPU0 and that CPU is
hogged by the printk thread to print stuff as well as to suspend the
system, and something eventually gets wrong.
I am only using the 3 patches from V12 version of the series.
--
viresh