Re: [PATCH printk v5 26/30] printk: nbcon: Implement emergency sections

From: Petr Mladek
Date: Tue May 21 2024 - 09:38:21 EST

Next message: Paul E. McKenney: "Re: [PATCH 2/2] rcu/tasks: Further comment ordering around current task snapshot on TASK-TRACE"
Previous message: David Howells: "[PATCH] netfs: Fix io_uring based write-through"
In reply to: Petr Mladek: "Re: [PATCH printk v5 26/30] printk: nbcon: Implement emergency sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu 2024-05-02 23:44:35, John Ogness wrote:
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>
> In emergency situations (something has gone wrong but the
> system continues to operate), usually important information
> (such as a backtrace) is generated via printk(). Each
> individual printk record has little meaning. It is the
> collection of printk messages that is most often needed by
> developers and users.
>
> In order to help ensure that the collection of printk messages
> in an emergency situation are all stored to the ringbuffer as
> quickly as possible, disable console output for that CPU while
> it is in the emergency situation. The consoles need to be
> flushed when exiting the emergency situation.
>
> Add per-CPU emergency nesting tracking because an emergency
> can arise while in an emergency situation.
>
> Add functions to mark the beginning and end of emergency
> sections where the urgent messages are generated.
>
> Do not print if the current CPU is in an emergency state.
>
> When exiting all emergency nesting, flush nbcon consoles
> directly using their atomic callback. Legacy consoles are
> triggered for flushing via irq_work because it is not known
> if the context was safe for a trylock on the console lock.
>
> Note that the emergency state is not system-wide. While one CPU
> is in an emergency state, another CPU may continue to print
> console messages.
>
> Co-developed-by: John Ogness <john.ogness@xxxxxxxxxxxxx>
> Signed-off-by: John Ogness <john.ogness@xxxxxxxxxxxxx>
> Signed-off-by: Thomas Gleixner (Intel) <tglx@xxxxxxxxxxxxx>

> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -1199,6 +1228,93 @@ void nbcon_atomic_flush_unsafe(void)
> __nbcon_atomic_flush_pending(prb_next_reserve_seq(prb), true);
> }
>
> +/**
> + * nbcon_cpu_emergency_enter - Enter an emergency section where printk()
> + * messages for that CPU are only stored
> + *
> + * Upon exiting the emergency section, all stored messages are flushed.
> + *
> + * Context: Any context. Disables preemption.
> + *
> + * When within an emergency section, no printing occurs on that CPU. This
> + * is to allow all emergency messages to be dumped into the ringbuffer before
> + * flushing the ringbuffer. The actual printing occurs when exiting the
> + * outermost emergency section.
> + */
> +void nbcon_cpu_emergency_enter(void)
> +{
> + unsigned int *cpu_emergency_nesting;
> +
> + preempt_disable();
> +
> + cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
> + (*cpu_emergency_nesting)++;
> +}
> +
> +/**
> + * nbcon_cpu_emergency_exit - Exit an emergency section and flush the
> + * stored messages
> + *
> + * Flushing only occurs when exiting all nesting for the CPU.
> + *
> + * Context: Any context. Enables preemption.
> + */
> +void nbcon_cpu_emergency_exit(void)
> +{
> + unsigned int *cpu_emergency_nesting;
> + bool do_trigger_flush = false;
> +
> + cpu_emergency_nesting = nbcon_get_cpu_emergency_nesting();
> +
> + /*
> + * Flush the messages before enabling preemtion to see them ASAP.
> + *
> + * Reduce the risk of potential softlockup by using the
> + * flush_pending() variant which ignores messages added later. It is
> + * called before decrementing the counter so that the printing context
> + * for the emergency messages is NBCON_PRIO_EMERGENCY.
> + */
> + if (*cpu_emergency_nesting == 1) {
> + nbcon_atomic_flush_pending();
> + do_trigger_flush = true;
> + }

The commit message says:

"Legacy consoles are triggered for flushing via irq_work because
it is not known if the context was safe for a trylock on the
console lock."

I do not feel completely comfortable with this. If printk() knows
when it is safe to call console_trylock() then we should know as
well.

IMHO, we could just call the below implemented nbcon_cpu_emergency_flush() here.

> +
> + (*cpu_emergency_nesting)--;
> +
> + if (WARN_ON_ONCE(*cpu_emergency_nesting < 0))
> + *cpu_emergency_nesting = 0;
> +
> + preempt_enable();
> +
> + if (do_trigger_flush)
> + printk_trigger_flush();
> +}
> +
> +/**
> + * nbcon_cpu_emergency_flush - Explicitly flush consoles while
> + * within emergency context
> + *
> + * Both nbcon and legacy consoles are flushed.
> + *
> + * It should be used only when there are too many messages printed
> + * in emergency context, for example, printing backtraces of all
> + * CPUs or processes. It is typically needed when the watchdogs
> + * need to be touched as well.
> + */
> +void nbcon_cpu_emergency_flush(void)
> +{
> + /* The explicit flush is needed only in the emergency context. */
> + if (*(nbcon_get_cpu_emergency_nesting()) == 0)
> + return;
> +
> + nbcon_atomic_flush_pending();
> +
> + if (printing_via_unlock && !in_nmi()) {
> + if (console_trylock())
> + console_unlock();
> + }

Hmm, we should also check whether we are in printk_safe context.
We could implement:

bool is_printk_deferred(void)
{
/*
* The per-CPU variable @printk_context can be read safely in any
* context. The CPU migration always disabled when set.
*/
return this_cpu_read(printk_context) || in_nmi();
}

How does that sound?

Best Regards,
Petr

Next message: Paul E. McKenney: "Re: [PATCH 2/2] rcu/tasks: Further comment ordering around current task snapshot on TASK-TRACE"
Previous message: David Howells: "[PATCH] netfs: Fix io_uring based write-through"
In reply to: Petr Mladek: "Re: [PATCH printk v5 26/30] printk: nbcon: Implement emergency sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]