Re: [PATCH printk v1 11/13] printk: reimplement console_lock for proper kthread support

From: Petr Mladek
Date: Fri Mar 11 2022 - 05:26:42 EST


On Thu 2022-03-10 17:14:18, John Ogness wrote:
> On 2022-03-10, Petr Mladek <pmladek@xxxxxxxx> wrote:
> > console_unlock()
> > {
> > [...]
> > if (may_schedule)
> > retry = console_trylock_sched();
> > else
> > retry = console_trylock();
> > }
>
> I believe the main confusion comes from the function name I chose and
> the poor function description. Using your above code idea and changing
> to a more fitting name, I would suggest:
>
> console_unlock()
> {
> [...]
> if (may_schedule)
> retry = console_lock_reacquire();
> else
> retry = console_trylock();
> }
>
> This console_lock_reacquire() acquires the console lock the same way
> that console_lock() does it. The only reason we don't just use
> console_lock() is because we want to perform a try on @console_sem. But
> if we are successful, in the end, we have re-taken the console lock
> exactly as console_lock() did before: @console_sem locked, kthreads
> blocked by mutex.
>
> You say this creates deadlock potential, but I do not see how that could
> be. We are in the same context and locking the same way we did before.

Yeah, it looks safe when using to re-acquire the lock in console_unlock().

I thought about it as a generic API. If it exists, it might be used
anywhere. And it has pretty special behavior that one has to keep
in mind.


> But my primary concern is not the naming or workarounds or confusing
> APIs. So we should not let ourselves be diverted by that aspect.
>
> My primary concern is the technical difference when a schedulable
> context reacquires via atomic counter (which fails if any kthread is
> active) vs. reacquiring via mutex (which never fails).
>
> The reason for the reacquire is because (during direct printing) we see
> that a new record appeared and we need to make sure it gets printed
> (because other direct printers may have aborted, expecting us to print
> it).

I see. I missed this POV.


> This scenario is only interesting if kthread printers exist because
> otherwise @console_sem is enough to handle the direct printing.
>
> So the questions are:
>
> 1. Is it OK to assume the new record will be printed if any kthread is
> active? If yes, then it is enough to use the atomic counter.
>
> 2. Or, since we are responsible for direct printing, do we want to be
> certain that the record is printed by printing it ourselves? If yes,
> then we must block all the kthreads and perform the printing directly to
> all the consoles. This requires the mutex approach.
>
> IMHO #1 will relies heavily on kthreads waking up and printing (even
> though the printk caller requested direct printing), whereas #2 will
> cause direct printers to more actively print (possibly printing more
> than was requested).

OK, it means that the main problem here _is not_ the scheduling context,
console_lock() vs. console_trylock(). The main problem _is_ the direct
printing vs. the offload to kthreads.

Of course, the context is important. It affects how we could re-take
the lock. But the main problem is the printing mode. We must make sure
that:

1. someone is printing pending messages when the direct mode is needed

2. kthreads are woken and can enter the printing mode when the direct
mode is disabled.

Will console_trylock_sched()/console_trylock_reacquire() really
help here?

The API theoretically helps in direct mode when the lock was taken
via console_lock(). But it does not help when the lock was taken
via console_trylock() from printk(). It might mean that
the forward progress might not be guaranteed in the direct mode
(early boot, panic, ...).

Hmm, the forward progress seems to be guaranteed in the direct
mode most of the time. console_trylock() can take over
the atomic counter because console kthreads are not allowed
to enter the printing mode in this case.

I used "most of the time" because there might be races when
the mode is switched. "printk_direct" is an atomic variable.
CON_DIRECT is set under con->mutex but console_trylock()
does not take the mutex...

There are also races when the offload to consoles kthreads
is allowed. For example, console_trylock() might block
console_kthread_printing_tryenter().


Sigh, I am afraid that we have non-trivial problems
to guarantee that all messages will be printed:

+ races when switching between direct mode
and offload to kthreads. It might cause
stall in both modes.

+ console_trylock() races with
console_kthread_printing_tryenter().
It might put kthread into a sleep even when
it is supposed to print the message.


IMHO, console_trylock_sched() does not help much here.
We need to solve console_trylock() path anyway.

I think that the solution might be:

+ make sure that the state of "printk_direct" atomic variable
is enough to distinguish about the mode.

+ always wakeup() console kthreads after console_trylock()
to handle the possible race with
console_kthread_printing_tryenter()

I have to think more about it.

Does it make any sense, please?

Best Regards,
Petr