Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
From: Thomas Gleixner
Date: Tue May 28 2024 - 18:12:53 EST
On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
> On 5/27/24 23:38, Tony W Wang-oc wrote:
> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>> index c96ae8fee95e..ecadd0698d6a 100644
>> --- a/arch/x86/kernel/hpet.c
>> +++ b/arch/x86/kernel/hpet.c
>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>> if (in_nmi())
>> return (u64)hpet_readl(HPET_COUNTER);
>>
>> + /*
>> + * Read HPET directly if panic in progress.
>> + */
>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>> + return (u64)hpet_readl(HPET_COUNTER);
>> +
>
> There is literally one other piece of the code in the kernel doing
> something similar: the printk() implementation. There's no other
> clocksource or timekeeping code that does this on any architecture.
>
> Why doesn't this problem apply to any other clock sources?
I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.
Think about i8253 :)
Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.
> Why should the problem be fixed in the clock sources themselves? Why
> doesn't printk() deadlock on systems using the HPET?
Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.
> In other words, I think we should fix pstore to be more like printk
> rather than hacking around this in each clock source.
pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.
Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:
if (in_nmi())
return (u64)hpet_readl(HPET_COUNTER);
For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.
Now there are two other scenarios which really can make that happen:
1) A non-NMI class exception within the lock held region
CPU A
acquire(hpet_lock);
... <- #PF, #GP, #DE, ... -> panic()
If any of that happens within that lock held section then the live
lock on the hpet_lock is the least of your worries. Seriously, I
don't care about this at all.
2) The actual scenario is:
CPU A CPU B
lock(hpet_lock)
MCE hits user space
...
exc_machine_check_user()
irqentry_enter_from_user_mode(regs);
irqentry_enter_from_user_mode() obviously does not mark the
exception as NMI class, so in_nmi() evaluates to false. That would
actually dead lock if CPU A is not making progress and releases
hpet_lock.
Sounds unlikely to happen, right? But in reality it can because of
MCE broadcast. Assume that both CPUs go into MCE:
CPU A CPU B
lock(hpet_lock)
exc_machine_check_user()
irqentry_enter_from_user_mode();
exc_machine_check_kernel() do_machine_check()
irqentry_nmi_enter(); mce_panic()
do_machine_check() if (atomic_inc_return(&mce_panicked) > 1)
mce_panic() wait_for_panic(); <- Not taken
if (atomic_inc_return(&mce_panicked) > 1)
wait_for_panic(); <- Taken
....
hpet_read()
-> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
obviously can't release the lock.
So the proposed patch makes sense to some extent. But it only cures the
symptom. The real underlying questions are:
1) Should we provide a panic mode read callback for clocksources which
are affected by this?
2) Is it correct to claim that a MCE which hits user space and ends up in
mce_panic() is still just a regular exception or should we upgrade to
NMI class context when we enter mce_panic() or even go as far to
upgrade to NMI class context for any panic() invocation?
#1 Solves it at the clocksource level. It still needs HPET specific
changes.
#2 Solves a whole class of issues
... while potentially introducing new ones :)
To me upgrading any panic() invocation to NMI class context makes a
lot of sense because in that case all bets are off.
in_nmi() is used in quite some places to avoid such problems. IOW,
that would kill a whole class of issues instead of "curing" the HPET
problem locally for the price of an extra conditional. Not that the
extra conditional matters much if HPET is the clocksource as that's
awfully slow anyway and I really don't care about that.
But I very much care about avoiding to sprinkle panic_cpu checks all
over the place.
Thanks,
tglx