Re: Relax CPU features sanity checking on heterogeneous architectures

From: Jeremy Linton
Date: Thu Oct 17 2019 - 17:40:02 EST


Hi,

On 10/11/19 8:54 AM, Mark Rutland wrote:
On Fri, Oct 11, 2019 at 02:33:43PM +0100, Marc Zyngier wrote:
On Fri, 11 Oct 2019 11:50:11 +0100
Mark Rutland <mark.rutland@xxxxxxx> wrote:

Hi,

On Fri, Oct 11, 2019 at 11:19:00AM +0530, Sai Prakash Ranjan wrote:
On latest QCOM SoCs like SM8150 and SC7180 with big.LITTLE arch, below
warnings are observed during bootup of big cpu cores.

For reference, which CPUs are in those SoCs?

SM8150:

[ 0.271177] CPU features: SANITY CHECK: Unexpected variation in
SYS_ID_AA64PFR0_EL1. Boot CPU: 0x00000011112222, CPU4: 0x00000011111112

The differing fields are EL3, EL2, and EL1: the boot CPU supports
AArch64 and AArch32 at those exception levels, while the secondary only
supports AArch64.

Do we handle this variation in KVM?

We do, at least at vcpu creation time (see kvm_reset_vcpu). But if one
of the !AArch32 CPU comes in late in the game (after we've started a
guest), all bets are off (we'll schedule the 32bit guest on that CPU,
enter the guest, immediately take an Illegal Exception Return, and
return to userspace with KVM_EXIT_FAIL_ENTRY).

Ouch. We certainly can't remove the warning untill we deal with that
somehow, then.

Not sure we could do better, given the HW. My preference would be to
fail these CPUs if they aren't present at boot time.

I agree; I think we need logic to check the ID register fields against
their EXACT, {LOWER,HIGHER}_SAFE, etc rules regardless of whether we
have an associated cap. That can then abort a late onlining of a CPU
which violates those rules w.r.t. the finalised system value.

Except one of the cases is the user who doesn't care about aarch32 @ el2/1 and just wants to add another core to their 64-bit "clean" OS.

So my $.02 is the online should only fail if someone has actually started a 32-bit guest on the machine.


I suspect that we may want to split the notion of
safe-for-{user,kernel-guest} in the feature tables, as if nothing else
it will force us to consider those cases separately when adding new
stuff.

As i'm sure everyone knows, this is all going to happen again with el0 support. I wonder if some of this more "advanced" functionality should be buried behind EXPERT. At least on ACPI its possible to tell at early boot if the machine is heterogeneous (not necessarily in which ways) and just automatically sanitize away 32-bit support and some of the stickier things when a heterogeneous machine is detected.