Re: [PATCH printk v3 07/19] printk: Add helpers for flush type logic

From: Petr Mladek
Date: Fri Jul 26 2024 - 11:51:32 EST


On Mon 2024-07-22 19:25:27, John Ogness wrote:
> There are many call sites where console flushing occur.
> Depending on the system state and types of consoles, the flush
> methods to use are different. A flush call site generally must
> consider:
>
> @have_boot_console
> @have_nbcon_console
> @have_legacy_console
> @legacy_allow_panic_sync
> is_printk_preferred()
>
> and take into account the current CPU state:
>
> NBCON_PRIO_NORMAL
> NBCON_PRIO_EMERGENCY
> NBCON_PRIO_PANIC
>
> in order to decide if it should:
>
> flush nbcon directly via atomic_write() callback
> flush legacy directly via console_unlock
> flush legacy via offload to irq_work
>
> All of these call sites use their own logic to make this
> decision, which is complicated and error prone. Especially
> later when two more flush methods will be introduced:
>
> flush nbcon via offload to kthread
> flush legacy via offload to kthread
>
> Introduce a new internal struct console_flush_type that
> specifies the flush method(s) that are available for a
> particular call site to use.
>
> Introduce helper functions to fill out console_flush_type to
> be used for emergency and non-emergency call sites.
>
> In many system states it is acceptable to flush legacy directly
> via console_unlock or via offload to irq_work. For this reason
> the non-emergency helper provides an argument @prefer_offload
> for the caller to specify which method it is interested in
> performing. (The helper functions will never allow both.)
>
> Replace the logic of all flushing call sites to use the new
> helpers. Note that this cleans up various corner cases where
> is_printk_preferred() and @have_boot_console were not being
> considerered before.
>
> Signed-off-by: John Ogness <john.ogness@xxxxxxxxxxxxx>

> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2330,9 +2331,13 @@ static bool legacy_allow_panic_sync;
> */
> void printk_legacy_allow_panic_sync(void)
> {
> + struct console_flush_type ft;
> +
> legacy_allow_panic_sync = true;
>
> - if (printing_via_unlock && !in_nmi()) {
> + printk_get_console_flush_type(&ft, false);
> +
> + if (ft.legacy_direct && !in_nmi()) {

in_nmi() check is not needed any longer. It is already done in
is_printk_deferred() in printk_get_console_flush_type().

> if (console_trylock())
> console_unlock();
> }
> @@ -2342,7 +2347,8 @@ asmlinkage int vprintk_emit(int facility, int level,
> const struct dev_printk_info *dev_info,
> const char *fmt, va_list args)
> {
> - bool do_trylock_unlock = printing_via_unlock;
> + struct console_flush_type ft;
> + bool defer_legacy = false;
> int printed_len;
>
> /* Suppress unimportant messages after panic happens */
> @@ -2360,41 +2366,19 @@ asmlinkage int vprintk_emit(int facility, int level,
> if (level == LOGLEVEL_SCHED) {
> level = LOGLEVEL_DEFAULT;
> /* If called from the scheduler, we can not call up(). */
> - do_trylock_unlock = false;
> + defer_legacy = true;
> }
>
> printk_delay(level);
>
> printed_len = vprintk_store(facility, level, dev_info, fmt, args);
>
> - if (have_nbcon_console && !have_boot_console) {
> - bool is_panic_context = this_cpu_in_panic();
> + printk_get_console_flush_type(&ft, false);

We could pass "defer_legacy" here. It won't be needed to check it
later then.

> - /*
> - * In panic, the legacy consoles are not allowed to print from
> - * the printk calling context unless explicitly allowed. This
> - * gives the safe nbcon consoles a chance to print out all the
> - * panic messages first. This restriction only applies if
> - * there are nbcon consoles registered.
> - */
> - if (is_panic_context)
> - do_trylock_unlock &= legacy_allow_panic_sync;
> + if (ft.nbcon_atomic)
> + nbcon_atomic_flush_pending();
>
> - /*
> - * There are situations where nbcon atomic printing should
> - * happen in the printk() caller context:
> - *
> - * - When this CPU is in panic.
> - *
> - * Note that if boot consoles are registered, the console
> - * lock/unlock dance must be relied upon instead because nbcon
> - * consoles cannot print simultaneously with boot consoles.
> - */
> - if (is_panic_context)
> - nbcon_atomic_flush_pending();

Just for record. If I get it correctly than this actually fixes a bug.
The original code called nbcon_atomic_flush_pending() only in panic().
The nbcon consoles were not flushed when there were only nbcon consoles
(printing_via_unlock == false).

I think that we did not notice it because it probably got fixed by
later patches adding the printk kthreads.

> - }
> -
> - if (do_trylock_unlock) {
> + if (!defer_legacy && ft.legacy_direct) {

@defer_legacy should not be needed if we passed it to
printk_get_console_flush_type().

> /*
> * The caller may be holding system-critical or
> * timing-sensitive locks. Disable preemption during
> @@ -2409,22 +2393,17 @@ asmlinkage int vprintk_emit(int facility, int level,
> * semaphore. The release will print out buffers. With the
> * spinning variant, this context tries to take over the
> * printing from another printing context.
> - *
> - * Skip it in EMERGENCY priority. The console will be
> - * explicitly flushed when exiting the emergency section.
> */
> - if (nbcon_get_default_prio() != NBCON_PRIO_EMERGENCY) {
> - if (console_trylock_spinning())
> - console_unlock();
> - }
> + if (console_trylock_spinning())
> + console_unlock();
>
> preempt_enable();
> }
>
> - if (do_trylock_unlock)
> - wake_up_klogd();
> - else
> + if ((defer_legacy && ft.legacy_direct) || ft.legacy_offload)

ditto.

> defer_console_output();
> + else
> + wake_up_klogd();
>
> return printed_len;
> }
> @@ -2728,10 +2707,15 @@ void resume_console(void)
> */
> static int console_cpu_notify(unsigned int cpu)
> {
> - if (!cpuhp_tasks_frozen && printing_via_unlock) {
> - /* If trylock fails, someone else is doing the printing */
> - if (console_trylock())
> - console_unlock();
> + struct console_flush_type ft;
> +
> + if (!cpuhp_tasks_frozen) {
> + printk_get_console_flush_type(&ft, false);
> +
> + if (ft.legacy_direct) {
> + if (console_trylock())
> + console_unlock();
> + }

One might be curious why we do not flush nbcon consoles here.
I guess that it will be less important after introducing
the kthreads.

Could it be called before the kthreads are started?

Anyway, this looks like a common pattern. Maybe, we could hide
it into some helper and be in the safe side:

/* Try to flush consoles directly when needed. */
void try_console_flush()
{
struct console_flush_type ft;

printk_get_console_flush_type(&ft, false);

if (ft.nbcon_atomic)
nbcon_atomic_flush_pending();

if (ft.legacy_direct)
console_flush_all(false, &next_seq, &handover);
}

> }
> return 0;
> }

Otherwise, it looks good.

Heh, I wondered several times if it was worth it. The struct
console_flush_type and printk_get_console_flush_type()
sometimes looked like an overkill. But I see many advantages:

+ As the commit message says, the decision how to flush the messages
depend on many variables. And it is nice to have it in one place.

+ The logic will get even more complicated after adding the
kthreads.

+ printk_get_console_flush_type() allows to change the behavior
in many situations in one place.

+ printk_get_console_flush_type() allows to easily find all
locations where we decide how to flush the messages. It helps
to check affected code paths.

So, I think that it is worth it in the end.

Note that I did not check the emergency code paths because
they are going to be reworked as discussed in the printk pull
request for 6.11, see
https://lore.kernel.org/r/ZqJKbcLgTeYRkDd6@xxxxxxxxxxxxxxx

Best Regards,
Petr