Re: [BUG RT] dump-capture kernel not executed for panic in interrupt context

From: Joerg Vehlow
Date: Tue Sep 08 2020 - 01:48:37 EST


Hi Peter

On 9/7/2020 6:23 PM, peterz@xxxxxxxxxxxxx wrote:
According to the original comment in __crash_kexec, the mutex was used to
prevent a sys_kexec_load, while crash_kexec is executed. Your proposed patch
does not lock the mutex in crash_kexec.
Sure, but any mutex taker will (spin) wait for panic_cpu==CPU_INVALID.
And if the mutex is already held, we'll not run __crash_kexec() just
like the trylock() would do today.
Yes you are right, it should work.
This does not cover the original use
case anymore. The only thing that is protected now are two panicing cores at
the same time.
I'm not following. AFAICT it does exactly what the old code did.
Although maybe I didn't replace all kexec_mutex users, I now see that
thing isn't static.
Same thing here.

Actually, this implementation feels even more hacky to me....
It's more minimal ;-) It's simpler in that it only provides the required
semantics (as I understand them) and does not attempt to implement a
more general trylock() like primitive that isn't needed.
Here I cannot agree with you. There is a second trylock in kernel_kexec, that cannot
be protected using the panic_cpu, but it actually could still use mutex_trylock and check
the panic_cpu. This should work I guess:

int kexec_trylock(void) {
    if (!mutex_trylock(&kexec_mutex)) {
        return 0;
    }
    smp_mb();
    if (panic_cpu != PANIC_CPU_INVALID) {
         mutex_unlock(&kexec_mutex);
         return 0;
    }
    return 1;
}

Or do I miss something now? All functions protected by mutex_lock cannot be executed, after
kexec_trylock resturned 1. kexec_crash will execute up to mutex_is_locked and then roll back.
The only thing that can go wrong now is: kexec_trylock executes up to smb_mb. At the same time
kexec_crash executes mutex_is_locked, which returns false now and then before panic_cpu is reset,
kexec_trylock executes the panic_cpu check, and returns. Now both functions did not get the lock and
nothing is executed.

Does that sound right to you? If you have no further objections I will post it here

Jörg