Re: printk: selective deactivation of feature ignoring non panic cpu's messages

From: Petr Mladek
Date: Wed Feb 26 2025 - 08:59:23 EST


On Wed 2025-02-26 05:31:53, John Ogness wrote:
> Hi Donghyeok,
>
> On 2025-02-26, Donghyeok Choe <d7271.choe@xxxxxxxxxxx> wrote:
> > I would like to print out the message of non panic cpu as it is.
> > Can I use early_param to selectively disable that feature?
>
> I have no issues about allowing this type of feature for debugging
> purposes.

Yes. It makes sense. Another scenario might be when
panic_other_cpus_shutdown() is not able to stop some CPUs.
It might be useful to see messages from the problematic ones.

> I do not know if early_param is the best approach. I expect
> Petr will offer good insight here.

early_param() looks good to me. There are already similar early
parameters, for example, "ignore_loglevel".


> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index fb242739aec8..3f420e8bdb2c 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2368,6 +2368,17 @@ void printk_legacy_allow_panic_sync(void)
> > }
> > }
> >
> > +static bool __read_mostly keep_printk_all_cpu_in_panic;
> > +
> > +static int __init keep_printk_all_cpu_in_panic_setup(char *str)
> > +{
> > + keep_printk_all_cpu_in_panic = true;
> > + pr_info("printk: keep printk all cpu in panic.\n");
> > +
> > + return 0;
> > +}
> > +early_param("keep_printk_all_cpu_in_panic", keep_printk_all_cpu_in_panic_setup);
>
> Quite a long argument. I am horrible at naming. I expect Petr would have
> a good suggestion (if early_param is the way to go).

Heh. It seems to be hard to find a good name ;-)

Anyway, I would use "printk_" prefix to make it clear that
it is printk-related. The following comes to my mind:

+ printk_allow_non_panic_cpus
+ printk_keep_non_panic_cpus
+ printk_debug_non_panic_cpus

I prefer "printk_debug_non_panic_cpus", see below.


> > asmlinkage int vprintk_emit(int facility, int level,
> > const struct dev_printk_info *dev_info,
> > const char *fmt, va_list args)
> > @@ -2379,13 +2390,15 @@ asmlinkage int vprintk_emit(int facility, int level,
> > if (unlikely(suppress_printk))
> > return 0;
> >
> > - /*
> > - * The messages on the panic CPU are the most important. If
> > - * non-panic CPUs are generating any messages, they will be
> > - * silently dropped.
> > - */
> > - if (other_cpu_in_panic() && !panic_triggering_all_cpu_backtrace)
> > - return 0;
> > + if (!keep_printk_all_cpu_in_panic) {
> > + /*
> > + * The messages on the panic CPU are the most important. If
> > + * non-panic CPUs are generating any messages, they will be
> > + * silently dropped.
> > + */
> > + if (other_cpu_in_panic() && !panic_triggering_all_cpu_backtrace)
> > + return 0;
> > + }
>
> I would not nest it. Just something like:
>
> /*
> * The messages on the panic CPU are the most important. If
> * non-panic CPUs are generating any messages, they may be
> * silently dropped.
> */
> if (!keep_printk_all_cpu_in_panic &&
> !panic_triggering_all_cpu_backtrace &&
> other_cpu_in_panic()) {
> return 0;
> }

I would prefer this form as well.

Thinking loudly:

I wonder if this is actually safe. I recall that we simplified the
design somewhere because we expected that non-panic CPUs will not
add messages. I am not sure that I found all locations. But
we might want to revise:


1st problem: _prb_read_valid() skips non-finalized records on non-panic CPUs.

opinion: We should not do it in this case.


2nd problem: Is _prb_read_valid() actually safe when
panic_triggering_all_cpu_backtrace is true?

opinion: It should be safe because the backtraces from different CPUs
are serialized via printk_cpu_sync_get_irqsave().


3rd problem: nbcon_get_default_prio() returns NBCON_PRIO_NORMAL on
non-panic CPUs. As a result, printk_get_console_flush_type()
would suggest flushing like when the system works as expected.

But the legacy-loop will bail out after flushing one
message on one console, see console_flush_all(). It is weird
behavior.

Another question is who would flush the messages when the panic()
CPU does not reach the explicit flush.

opinion: We should probably try to flush the messages on non-panic
CPUs in this mode when safe. This is why I prefer the name
"printk_debug_non_panic_cpus".

We should update console_flush_all() to do not bail out when
the new option is set.

We should call nbcon_atomic_flush_pending() on non-panic CPUs
when the new option is set. printk_get_console_flush_type()
should behave like with NBCON_PRIO_EMERGENCY.

Maybe, nbcon_get_default_prio() should actually return
NBCON_PRIO_EMERGENCY on non-panic CPUs when this option is set.
It allow the non-panic CPUs to take over the nbcon context
from the potentially frozen kthread.


Best Regards,
Petr