Re: [RFC PATCH] KVM: arm64: Workaround for Ampere AC03_CPU_36 (exception taken to an incorrect EL)

From: D Scott Phillips
Date: Tue Jan 09 2024 - 15:29:25 EST


Marc Zyngier <maz@xxxxxxxxxx> writes:

> [reviewing both patches in one go, as it is way easier]
>
> On Fri, 05 Jan 2024 23:53:10 +0000,
> Oliver Upton <oliver.upton@xxxxxxxxx> wrote:
>>
>> Hi Ilkka,
>>
>> On Fri, Jan 05, 2024 at 01:32:51PM -0800, Ilkka Koskinen wrote:
>> > Due to erratum AC03_CPU_36 on AmpereOne, if an Asynchronous Exception
>> > (interrupts or SErrors) occurs to EL2, while EL2 software is modifying
>> > system register bits that control EL2 exception behavior, the processor
>> > may take an exception to an incorrect Exception Level.
>
> What needs to be described (both in the commit message and as part of
> the code) is under what circumstances this mis-routing happens.
>
> Is it that just clearing TGE while being at EL2 always results in the
> asynchronous exception being routed to the wrong exception level? Or
> is it a more subtle issue related to synchronisation?
>
> Also worth describing is to which other exception level is the
> exception delivered? EL1? EL3?
>
>> >
>> > The affected system registers are HCR_EL2 and SCTLR_EL2, which contain
>> > control bits for routing and enabling of EL2 exceptions.
>
> How does SCTLR_EL2 affects interrupt delivery? Is this related to
> FEAT_NMI and SCTLR_EL2.{NMI,SPINTMASK}? Because this is the only part
> of this register that has anything to do with interrupts.
>
>> >
>> > The issue is triggered when HGE.TGE bit is cleared while having
>> > AMO/IMO/FMO bits cleared too. To avoid the exception getting taken
>> > at a wrong Exception Level, we set AMO/IMO/FMO.
>>
>> We toggle HCR_EL2 for other things besides TLB invalidations, and the
>> changelog does not describe why they're apparently unaffected.
>>
>> > Suggested-by: D Scott Phillips <scott@xxxxxxxxxxxxxxxxxxxxxx>
>> > Signed-off-by: Ilkka Koskinen <ilkka@xxxxxxxxxxxxxxxxxxxxxx>
>>
>> This isn't an acceptable way to go about errata mitigations. Besides
>> extremely unusual circumstances, the pattern is to use a cpucap &&
>> alternatives to only enable the workaround on affected designs. We then
>> document the errata in the expected places (Kconfig and kernel
>> documentation) such that the folks saddled with maintaining this stuff
>> know how to handle it years down the line.
>
> +1. This hack will have to live forever, while the lack of
> documentation makes it totally unmaintainable. The KVM code *will*
> change in ways that cannot be anticipated today, and without
> exhaustive documentation, we will not be able to do a good job at
> maintaining this system alive by correctly mitigating the erratum.
>
>>
>> > ---
>> > arch/arm64/kvm/hyp/vhe/tlb.c | 12 +++++++++---
>> > 1 file changed, 9 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
>> > index b32e2940df7d..c72fdd2e4549 100644
>> > --- a/arch/arm64/kvm/hyp/vhe/tlb.c
>> > +++ b/arch/arm64/kvm/hyp/vhe/tlb.c
>> > @@ -61,9 +61,15 @@ static void __tlb_switch_to_guest(struct kvm_s2_mmu *mmu,
>> > * has an ISB in order to deal with this.
>> > */
>> > __load_stage2(mmu, mmu->arch);
>> > - val = read_sysreg(hcr_el2);
>> > - val &= ~HCR_TGE;
>> > - write_sysreg(val, hcr_el2);
>> > +
>> > + /*
>> > + * With {E2H,TGE} == {1,0}, IMO == 1 is required so that IRQs are not
>> > + * all masked.
>>
>> Huh? HCR_EL2.IMO affects the *routing* of IRQs at exception levels
>> *lower than* EL2.
>
> Yup, and there is *zero* requirement for IMO to have any particular
> value while running at EL2. As long as you're at EL2, physical
> interrupts that are not targeting EL3 are taken at EL2, full stop.

What's meant here is that when the configurations of HCR.{E2H,TGE,IMO}
== {1,0,0}, that's a "C" for irq target in the big table of IRQ target
ELs under "Establishing the target Exception level of an asynchronous
exception" meaning that no IRQs are taken, regardless of PSTATE.I.

It doesn't seem like the intent of the tlb flush code was to also mask
IRQs via HCR, which is why Ilkka proposed the change to also set IMO.

With that changed, there weren't any remaining places left that needed
the irq masking workaround which is why Ilkka added a comment instead of
a cpucap.

Agreed that this leaves open the possibility for future code running
afoul of the erratum. So I'd suggest that we make Ilkka's change to add
IMO here to take out the incidental hard IRQ masking but remove any
connection to the erratum, and then for the erratum make writes to
hcr_el2 on AmpereOne become:

mrs tmp, daif
msr daifset, #0xf
msr hcr_el2, value
isb
msr daif, tmp