Re: [PATCH] printk: avoid livelock if another CPU printks continuously
From: Petr Mladek
Date: Wed Feb 10 2016 - 09:44:27 EST
Sent again with a correct mail header, hopefully.
On Mon 2016-02-08 21:35:03, Denys Vlasenko wrote:
> At the end of each printk(), kernel attempts to take console_sem.
> If this succeeds, it feeds buffered message data to console devices
> until there is nothing left, and releases console_sem:
>
> if (console_trylock_for_printk(this_cpu))
> console_unlock();
>
> The livelock exists because code in console_unlock() has no
> limit on the amount of buffered data it would process under
> console_sem. This is bad if printk() was called with IRQs disabled.
>
> This patch makes console_unlock() release console_sem after 5
> iterations, which usually amounts to 5 lines of printk messages,
> and give other printk'ing CPUs a chance to acquire console_sem.
>
> If some CPU grabs it, console_unlock() finishes.
> If no one takes the semaphore, console_unlock() re-acquires it
> and loops back for another cycle of console output.
>
> This seems to be a hard-to-trigger, but long-existing problem:
Yup, and there are more people trying to handle this. I add some
of them into CC.
Sadly, the problem is much more complicated that it looks. Jan Kara
(jack) has already provided many possible solutions that were not
accepted. The last one can be seen at
http://thread.gmane.org/gmane.linux.kernel/2105183/focus=2113787
See below some comments to your approach.
> Signed-off-by: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
> CC: linux-kernel@xxxxxxxxxxxxxxx
> CC: srostedt@xxxxxxxxxx
> CC: Steven Rostedt <rostedt@xxxxxxxxxxx>
> CC: Tejun Heo <tj@xxxxxxxxxx>
> CC: Peter Hurley <peter@xxxxxxxxxxxxxxxxxx>
> ---
> kernel/printk/printk.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index c963ba5..ca4f9d55 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2235,6 +2235,7 @@ void console_unlock(void)
> unsigned long flags;
> bool wake_klogd = false;
> bool do_cond_resched, retry;
> + unsigned cnt;
>
> if (console_suspended) {
> up_console_sem();
> @@ -2257,6 +2258,7 @@ void console_unlock(void)
> /* flush buffered message fragment immediately to console */
> console_cont_flush(text, sizeof(text));
> again:
> + cnt = 5;
> for (;;) {
> struct printk_log *msg;
> size_t ext_len = 0;
> @@ -2284,6 +2286,9 @@ skip:
> if (console_seq == log_next_seq)
> break;
>
> + if (--cnt == 0)
> + break; /* Someone else printk's like crazy */
> +
> msg = log_from_idx(console_idx);
> if (msg->flags & LOG_NOCONS) {
> /*
> @@ -2350,6 +2355,26 @@ skip:
> if (retry && console_trylock())
> goto again;
>
> + if (cnt == 0) {
> + /*
> + * Other CPU(s) printk like crazy, filling log_buf[].
> + * Try to get rid of the "honor" of servicing their data:
> + * give _them_ time to grab console_sem and start working.
> + */
> + cnt = 9999;
> + while (--cnt != 0) {
> + cpu_relax();
> + if (console_seq == log_next_seq) {
This condition is true when all available messages are printed to
the console. It means that there is nothing to do at all. It is
quite late. A much better solution would be to store console_seq
to a local variable and check it is being modified by an other CPU.
> + /* Good, other CPU entered "for(;;)" loop */
> + goto out;
> + }
> + }
> + /* No one seems to be willing to take it... */
> + if (console_trylock())
> + goto again; /* we took it */
> + /* Nope, someone else holds console_sem! Good */
The cycle gives a big chance other CPUs to enter console_unlock().
It means that more CPUs might end up in the above busy cycle.
It gives a chance to move the printing to another CPU. It likely
slows down the flood of messages because the producer end up
here as well.
So, it probably works but the performance is far from optimal.
Many CPUs might end up doing nothing. I am afraid that this is
not the right way to go.
Best Regards,
Petr