Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections

From: John Ogness
Date: Sun Oct 08 2023 - 06:13:35 EST

Next message: Guo Ren: "Re: [PATCH v2 2/2] riscv: errata: thead: use pa based instructions for CMO"
Previous message: Like Xu: "Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event"
Next in thread: Petr Mladek: "Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Petr,

On 2023-10-06, Petr Mladek <pmladek@xxxxxxxx> wrote:
>> During the demo at LPC2022 we had the situation that there was a large
>> backlog when a WARN was hit. With current mainline the first line of the
>> WARN is put into the ringbuffer and then the entire backlog is flushed
>> before storing the rest of the WARN into the ringbuffer. At the time it
>> was obvious that we should finish storing the WARN message and then
>> start flushing the backlog.
>
> This talks about the "emergency" context (WARN/OOPS/watchdog).
> The system might be in big troubles but it would still try to continue.
>
> Do we really want to defer the flush also for panic() context?

We can start flushing right after the backtrace is in the
ringbuffer. But flushing the backlog _before_ putting the backtrace into
the ringbuffer was not desired because if there is a large backlog, the
machine may not survive to get the backtrace out. And in that case it
won't even be in the ringbuffer to be used by other debugging
tools.

> I ask because I was not on LPC 2022 in person and I do not remember
> all details.

The LPC2022 demo/talk was recorded:

https://www.youtube.com/watch?v=TVhNcKQvzxI

At 55:55 is where the situation occurred and triggered the conversation,
ultimately leading to this new feature.

You may also want to reread my summary:

https://lore.kernel.org/lkml/875yheqh6v.fsf@xxxxxxxxxxxxxxxxxxxxx

as well as Linus' follow-up message:

https://lore.kernel.org/lkml/CAHk-=wieXPMGEm7E=Sz2utzZdW1d=9hJBwGYAaAipxnMXr0Hvg@xxxxxxxxxxxxxx

> But it is tricky in panic(), see 8th patch at
> https://lore.kernel.org/r/20230919230856.661435-9-john.ogness@xxxxxxxxxxxxx
>
> + nbcon_atomic_exit() is called only in one code path.

Correct. When panic() is complete and the machine goes into its infinite
loop. This is also the point where it will attempt an unsafe flush, if
it could not get the messages out yet.

> + nbcon_atomic_flush_all() is used in other paths. It looks like
> a "Whack a mole" game to me.

Several different outputs occur during panic(). The flush is everywhere
where something significant has been put into the ringbuffer and now it
would be OK to flush it.

> + messages are never emitted by printk kthread either because
> CPUs are stopped or the kthread is not allowed to get the lock

Correct.

> I see only one positive of the explicit flush. The consoles would
> not delay crash_exec() and the crash dump might be closer to
> the point where panic() was called.

It's only about getting the critical messages into the ringbuffer before
flushing. And since various things can go wrong during the many actions
within panic(), it makes sense to flush in between those actions.

> Otherwise I see only negatives => IMHO, we want to flush atomic
> consoles synchronously from printk() in panic().
>
> Does anyone really want explicit flushes in panic()?

So far you are the only one speaking against it. I expect as time goes
on it will get even more complex as it becomes tunable (also something
we talked about during the demo).

John

Next message: Guo Ren: "Re: [PATCH v2 2/2] riscv: errata: thead: use pa based instructions for CMO"
Previous message: Like Xu: "Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event"
Next in thread: Petr Mladek: "Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]