Re: [PATCH printk v4 17/27] printk: nbcon: Use nbcon consoles in console_flush_all()

From: John Ogness
Date: Thu Apr 18 2024 - 17:45:11 EST


On 2024-04-18, Petr Mladek <pmladek@xxxxxxxx> wrote:
>> > Solve this problem by introducing[*] nbcon_atomic_flush_all()
>> > which would flush even newly added messages and
>> > call this in nbcon_cpu_emergency_exit() when the printk
>> > kthread does not work. It should bail out when there
>> > is a panic in progress.
>> >
>> > Motivation: It does not matter which "atomic" context
>> > flushes NORMAL/EMERGENCY messages when
>> > the printk kthread is not available.
>>
>> I do not think that solves the problem. If the console is in an unsafe
>> section, nothing can be printed.
>
> IMHO, it solves the problem.
>
> The idea is simple:
>
> "The current nbcon console owner will be responsible for flushing
> all messages when the printk kthread does not exist."

Currently this is not valid. It assumes owners are printers. That is not
always true. The owner might be some task modifying the baud rate and
has nothing to do with printing.

> The prove is more complicated:
>
> 1. Let's put aside panic. We already do the best effort there.
>
> 2. Emergency mode currently violates the rule because
> nbcon_atomic_flush_pending() ignores the simple rule.
>
> => FIX: improve nbcon_cpu_emergency_exit() to flush
> all messages when kthreads are not ready.

Emergency mode cannot flush _anything_ if there is an owner in an unsafe
region. (And that owner may not be a printer.)

> 3. Normal mode flushes nbcon consoles via
> nbcon_legacy_emit_next_record() from console_unlock()
> before the kthreads are started.
>
> It is not reliable when nbcon_try_acquire() fails.
> But it would fail only when there is another user.
> The other owner might be:
>
> + panic: will handle everything
>
> + emergency: should flush everything [*]
>
> + normal: can't happen because of con->device() lock.

As the code is now, "normal" does not imply con->device() lock. When
using con->write_atomic(), we do not (and can not) use the con->device()
lock. So normal _can_ fail in nbcon_legacy_emit_next_record() if another
CPU is adjusting the baud rate. This is why I said the problem with
"emergency" is basically the same problem as "normal" (WRT using
write_atomic()).

> => The only remaining problem is to fix nbcon_atomic_flush_pending()
> to flush everything when the kthreads are not started yet.

I think my proposed change handles it better. I have been doing various
tests and also adjusted it a bit.

The reason the flushing fails is because another context owns the
console. So I think it makes sense for that owner to handle flushing
responsibility when releasing ownership (even if that context was just
changing the baud rate).

[ Keep in mind we are only talking about printing via write_atomic()
when the kthread is not available. ]

If the current owner is a normal printing context, it will print to
completion anyway (via console_flush_all()).

If the current owner is an emergency printing context, it will only
print the emergency messages (as PRIO_EMERGENCY). However, when it
releases ownership, it could flush the remaining records (as
PRIO_NORMAL) in the same fashion as console_flush_all() does it.

If the current owner is a non-printing context, when it releases
ownership, it could flush the remaining records (as PRIO_NORMAL) in the
same fashion as console_flush_all() does it.

So what I am proposing is that we add two new normal-flushing sites that
are only used when the kthread is not available:

1. after exiting emergency mode: in nbcon_cpu_emergency_exit()

2. after releasing ownership for non-printing: in nbcon_driver_release()

I think this will close the gap and it does not require irq_work.

> Sigh, all this is so complicated. I wonder how to document
> this so that other people do not have to discover these
> dependencies again and again. Is it even possible?

In the end we will have the following 5 scenarios (assuming my
proposal):

1. PRIO_NORMAL non-printing activity, always under con->device() lock,
upon release flushes[*] full backlog

2. PRIO_NORMAL printing using write_thread(), always called from task
context and under con->device() lock, always flushes full backlog

3. PRIO_NORMAL printing using write_atomic(), called with hardware
interrupts disabled, always flushes full backlog, (only used when the
kthread is not available)

4. PRIO_EMERGENCY printing using write_atomic(), called with hardware
interrupts disabled, flushes through emergency, upon exit flushes[*]
full backlog

5. PRIO_PANIC printing using write_atomic(), called with hardware
interrupts disabled, flushes full backlog

[*] Only when the kthread is not available. And in that case #3 is the
scenario used for the trailing full flushing by #1 and #4.


Full flushing is attempted in all 5 scenarios. A failed attempt means
there is a new owner, but that owner is also one of the 5 scenarios.

Am I missing something?

IMHO #3 is the only bizarre scenario. It exists only to cover when the
kthread is not available.

John