Re: [PATCH v4 2+/8] printk: nbcon: Add acquire/release logic

From: John Ogness
Date: Fri Sep 15 2023 - 13:02:55 EST


On 2023-09-14, Petr Mladek <pmladek@xxxxxxxx> wrote:
> // This patch includes my proposed changes on top of
> // https://lore.kernel.org/all/20230908185008.468566-3-john.ogness@xxxxxxxxxxxxx/

The changes in the code are fine. I agree with them and they pass all my
runtime tests. However, there are a couple places where I disagree with
your comments (or lack thereof) and would like to add some
clarification.

Following is your patched version, with my comments inline:

> static int nbcon_context_try_acquire_requested(struct nbcon_context *ctxt,
> struct nbcon_state *cur)
> {
> unsigned int cpu = smp_processor_id();
> struct console *con = ctxt->console;
> struct nbcon_state new;
>
> /* Note that the caller must still remove the request! */
> if (other_cpu_in_panic())
> return -EPERM;

You have no comments here. The following waiter check is an obvious
check to make. But what I think is not obvious, is that it will also
catch the situation when an unsafe hostile takeover has occurred. (My v4
had an explicit test for that case, but actually it is enough to check
the waiter.) I am adding comments here to mention that.

/*
* Note that the waiter will also change if there was an unsafe
* hostile takeover.
*/

> if (!nbcon_waiter_matches(cur, ctxt->prio))
> return -EPERM;
>
> /* If still locked, caller should continue waiting. */
> if (cur->prio != NBCON_PRIO_NONE)
> return -EBUSY;
>
> /*
> * The previous owner should have never released ownership
> * in an unsafe region. And this function should not be
> * called when the 'unsafe_takeover' is set.
> */

The above comment says this function should not be called when
@unsafe_takeover is set. But that can legally occur. While spinning
there is a udelay(1) and then the state is re-read. During the udelay()
an unsafe hostile takeover is allowed to occur. This is another reason
why I wanted to explicitly mention that in the comment above. And I am
removing the 2nd sentence of this comment.

> WARN_ON_ONCE(cur->unsafe);

[...]

> static int nbcon_context_try_acquire_handover(struct nbcon_context *ctxt,
> struct nbcon_state *cur)
> {
> unsigned int cpu = smp_processor_id();
> struct console *con = ctxt->console;
> struct nbcon_state new;
> int timeout;
> int request_err = -EBUSY;
>
> /*
> * Check that the handover is called when the direct acquire failed
> * with -EBUSY.
> */
> WARN_ON_ONCE(ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio);
> WARN_ON_ONCE(!cur->unsafe);
>
> /* Handover is not possible on the same CPU. */
> if (cur->cpu == cpu)
> return -EBUSY;
>
> /*
> * Console stays unsafe after an unsafe takeover until re-initialized.
> * Waiting is not going to help in this case.
> */
> if (cur->unsafe_takeover)
> return -EBUSY;
>
> /* Is the caller willing to wait? */
> if (ctxt->spinwait_max_us == 0)
> return -EBUSY;
>
> /*
> * Setup a request for the handover. The caller should try to acquire
> * the conosle directly when the current state has been modified.
> */
> new.atom = cur->atom;
> new.req_prio = ctxt->prio;
> if (!nbcon_state_try_cmpxchg(con, cur, &new))
> return -EAGAIN;
>
> cur->atom = new.atom;
>
> /* Wait until there is no owner and then acquire the console. */
> for (timeout = ctxt->spinwait_max_us; timeout >= 0; timeout--) {
> /* On successful acquire, this request is cleared. */
> request_err = nbcon_context_try_acquire_requested(ctxt, cur);
> if (!request_err)
> return 0;
>
> /*
> * If the acquire should be aborted, it must be ensured
> * that the request is removed before returning to caller.
> */
> if (request_err == -EPERM)
> break;
>
> udelay(1);
>
> /* Re-read the state because some time has passed. */
> nbcon_state_read(con, cur);
> }
>
> /* Timed out or aborted. Carefully remove handover request. */
> do {
> /* No need to remove request if there is a new waiter. */

For me it was not obvious why -EPERM should be returned here. I am
extending this comment to provide some more details that may not be
immediately obvious.

/*
* No need to remove request if there is a new waiter. This
* can only happen if a higher priority context has taken over
* the console or the handover request.
*/

> if (!nbcon_waiter_matches(cur, ctxt->prio))
> return -EPERM;

[...]

> static int nbcon_context_try_acquire_hostile(struct nbcon_context *ctxt,
> struct nbcon_state *cur)
> {
> unsigned int cpu = smp_processor_id();
> struct console *con = ctxt->console;
> struct nbcon_state new;
>
> if (!ctxt->allow_unsafe_takeover)
> return -EPERM;
>
> /*
> * Check that the system is going to die and lower priorities will
> * always be ignored. Using an unsafe consoles could break the running
> * system. Also switching back to lower priorities would create a race
> * in nbcon_waiter_matches().
> */

Three things about this comment. First, it is not just a "check" but
actually an "enforcement". The if-statement prevents the hostile
takeover from happening.

Second, it mentions that "lower priorities will always be ignored". I
was confused by this because lower priorities are _always_ ignored. Only
higher priorities can ever take over ownership.

Third, it mentions switching back to lower priorities would create races
in nbcon_waiter_matches(). This seems like a strange place to talk about
that race. The comments within nbcon_waiter_matches() talk about why
that race does not exist. The unsafe hostile takeover is not a part of
that explanation.

It should also be clear that after all pending records have been flushed
in panic, the panic CPU _will_ release the console. What will prevent
other CPUs from taking ownership after that is _not_ a PANIC-prio owner
of the console, but rather because it is a panic situation and other
CPUs are not the panic CPU.

I am just simplifying the comment to:

/* Ensure caller is allowed to perform unsafe hostile takeovers. */

> if (WARN_ON_ONCE(ctxt->prio != NBCON_PRIO_PANIC))
> return -EPERM;

John Ogness