Re: [PATCH 2/2] kdb: Call vkdb_printf() from vprintk_default() only when wanted

From: Petr Mladek
Date: Wed Nov 23 2016 - 11:29:39 EST


On Tue 2016-11-22 14:32:39, Daniel Thompson wrote:
> On 22/11/16 12:45, Petr Mladek wrote:
> >On Mon 2016-11-07 10:24:22, Daniel Thompson wrote:
> >>On 21/10/16 13:50, Petr Mladek wrote:
> >>>kdb_trap_printk allows to pass normal printk() messages to kdb via
> >>>vkdb_printk(). For example, it is used to get backtrace using
> >>>the classic show_stack(), see kdb_show_stack().
> >>>
> >>>vkdb_printf() tries to avoid a potential infinite loop by disabling
> >>>the trap. But this approach is racy, for example:
> >>>
> >>>CPU1 CPU2
> >>>
> >>>vkdb_printf()
> >>> // assume that kdb_trap_printk == 0
> >>> saved_trap_printk = kdb_trap_printk;
> >>> kdb_trap_printk = 0;
> >>>
> >>> kdb_show_stack()
> >>> kdb_trap_printk++;
> >>
> >>When kdb is running any of the commands that use kdb_trap_printk
> >>there is a single active CPU and the other CPUs should be in a
> >>holding pen inside kgdb_cpu_enter().
> >>
> >>The only time this is violated is when there is a timeout waiting
> >>for the other CPUs to report to the holding pen.
> >
> >It means that the race window is small but it is there. Do I get
> >it correctly, please?
>
> I don't think the size of the race window is particularly
> interesting; it does exist.
>
> It is more important to be clear that the circumstances when the
> race exists are when an attempt to stop-the-world has failed. So
> rather than close the race it might be more useful to make the race
> unreachable by bringing the CPU that has failed to report under
> control.

Yup. The question is how the stop-the-world could fail and if we could
prevent all failures now and in the future.

> vkdb_printf()'s rough job is to:
>
> 1. Issue the string to the kdb console (not the same as kernel
> console)
> 2. Store the string in the kernel logs.
>
> Of the above, #2 is achieved by temporarily setting the console log
> level to 0, calling printk() and restoring the log level. This is
> why vkdb_printf() needs the silly dance to avoid the recursion.
>
> Right now I'm toying with the idea of using the printk_func code for
> printk() interception.
>
> Firstly printk_func is per-cpu which removes a huge chain of complex
> reasoning (for a case that doesn't really exist).
>
> Secondly we could call the saved_printk_func directly making
> recursion impossible.

Note that there are plans to define per-CPU printk_context variable,
see https://lkml.kernel.org/r/20161027154933.1211-4-sergey.senozhatsky@xxxxxxxxx
It will be used to choose the right printk_func. It will allow
to have more contexts and enter some of them recursively.

Then it should be easier to add support for KDB context.


> Sadly, if we accept that a rouge CPU might still be calling printk()
> then the approach above just opens up a race on the console log
> level...

We are also very close to add async printk. It will allow to
delegate the console handling into separate kthread. Then printk()
should not longer call console directly. The only exception would
be a system panic, when we really want to get messages out ASAP.

This last patchset can be seen at
https://lkml.kernel.org/r/1461333180-2897-1-git-send-email-sergey.senozhatsky@xxxxxxxxx
But Sergey plans to get in the printk_context/printk_safe stuff first.

Anyway, then it should be easier to disable the console in kdb
context.

Best Regards,
Petr

PS: I am going to prepare v2 of this patchset. It might take
some time as I am repeatedly interrupted by some other urgent work.