Re: v6.13-rc1: Internal error: Oops - Undefined instruction: 0000000002000000 [#1] SMP

From: Marc Zyngier
Date: Wed Dec 04 2024 - 14:17:42 EST


On Wed, 04 Dec 2024 18:34:53 +0000,
Vitaly Chikunov <vt@xxxxxxxxxxxx> wrote:
>
> Marc,
>
> On Wed, Dec 04, 2024 at 08:51:26AM +0000, Marc Zyngier wrote:
> > On Tue, 03 Dec 2024 22:14:53 +0000,
> > Vitaly Chikunov <vt@xxxxxxxxxxxx> wrote:
> > >
> > > Shameer, Marc, Oliver, Will,
> > >
> > > On Tue, Dec 03, 2024 at 10:03:11AM +0000, Shameerali Kolothum Thodi wrote:
> > > > > -----Original Message-----
> > > > > From: linux-arm-kernel <linux-arm-kernel-bounces@xxxxxxxxxxxxxxxxxxx> On
> > > > > Behalf Of Vitaly Chikunov
> > > > > Sent: Tuesday, December 3, 2024 9:27 AM
> > > > > To: Marc Zyngier <maz@xxxxxxxxxx>
> > > > > Cc: Will Deacon <will@xxxxxxxxxx>; james.morse@xxxxxxx; linux-arm-
> > > > > kernel@xxxxxxxxxxxxxxxxxxx; Catalin Marinas <catalin.marinas@xxxxxxx>;
> > > > > linux-kernel@xxxxxxxxxxxxxxx; oliver.upton@xxxxxxxxx;
> > > > > mark.rutland@xxxxxxx
> > > > > Subject: Re: v6.13-rc1: Internal error: Oops - Undefined instruction:
> > > > > 0000000002000000 [#1] SMP
> > > > >
> > > > > Marc,
> > > > >
> > > > > On Tue, Dec 03, 2024 at 01:31:19AM +0300, Vitaly Chikunov wrote:
> > > > > > On Mon, Dec 02, 2024 at 04:07:03PM +0000, Marc Zyngier wrote:
> > > > > > > On Mon, 02 Dec 2024 15:59:40 +0000,
> > > > > > > Vitaly Chikunov <vt@xxxxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Marc,
> > > > > > > >
> > > > > > > > On Mon, Dec 02, 2024 at 03:53:59PM +0000, Marc Zyngier wrote:
> > > > > > > > >
> > > > > > > > > What the log doesn't say is what the host is. Is it 6.13-rc1 as well?
> > > > > > > >
> > > > > > > > No, host is 6.6.60.
> > > > > > >
> > > > > > > Right. I wouldn't be surprised if:
> > > > > > >
> > > > > > > - this v6.6 kernel doesn't hide the MPAM feature as it should (and
> > > > > > > that's proably something we should backport)
> > > > > >
> > > > > > How to confirm this? Currently I cannot find any (case-insensitive)
> > > > > > "MPAM" files in /sys, nor mpam string in /proc/cpuinfo, nor MPAM
> > > > > > strings in `strace -v` (as it decodes some KVM ioctls) of qemu process.
> > > > > >
> > > > > > >
> > > > > > > - you get a nastygram in the host log telling you that the guest has
> > > > > > > executed something it shouldn't (you'll get the encoding of the
> > > > > > > instruction)
> > > > > >
> > > > > > I requested admins of the box for dmesg output since I don't have root
> > > > > > access myself and nowadays dmesg is not accessible for a user.
> > > > >
> > > > > This is what they reported:
> > > > >
> > > > > kvm [2502822]: Unsupported guest sys_reg access at: ffff80008003e9f0
> > > > > [000000c5]
> > > > > { Op0( 3), Op1( 0), CRn(10), CRm( 4), Op2( 4), func_read },
> > > > >
> > > >
> > > > As Will pointed out I think this is access to MPAMIDR_EL1 and is from this
> > > > code here,
> > > >
> > > > +++ b/arch/arm64/kernel/cpuinfo.c
> > > > @@ -478,6 +478,9 @@ static void __cpuinfo_store_cpu(struct cpuinfo_arm64 *info)
> > > > if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0))
> > > > __cpuinfo_store_cpu_32bit(&info->aarch32);
> > > >
> > > > + if (id_aa64pfr0_mpam(info->reg_id_aa64pfr0))
> > > > + info->reg_mpamidr = read_cpuid(MPAMIDR_EL1);
> > > > +
> > > > cpuinfo_detect_icache_policy(info);
> > > > }
> > > >
> > > > I did manage to boot my setup in 6.6 and this is what happens,
> > > >
> > > > Host kernel 6.6
> > > > Guest Kernel 6.13-rc1
> > > >
> > > > [ 0.195392] smp: Brought up 1 node, 8 CPUs
> > > > [ 0.219000] SMP: Total of 8 processors activated.
> > > > [ 0.219629] CPU: All CPU(s) started at EL1
> > > > ...
> > > > [ 0.223212] CPU features: detected: RAS Extension Support
> > > > [ 0.223927] CPU features: detected: Memory Partitioning And Monitoring
> > > > [ 0.224796] CPU features: detected: Memory Partitioning And Monitoring Virtualisation
> > > > [ 0.225961] alternatives: applying system-wide alternatives
> > > > ...
> > > >
> > > > Guest detects MPAM and boots fine.
> > > >
> > > > Host kernel 6.13-rc1
> > > > Guest Kernel 6.13-rc1
> > > >
> > > > [ 0.196625] smp: Brought up 1 node, 8 CPUs
> > > > [ 0.222093] SMP: Total of 8 processors activated.
> > > > [ 0.222769] CPU: All CPU(s) started at EL1
> > > > ...
> > > > [ 0.226620] CPU features: detected: RAS Extension Support
> > > > [ 0.227453] alternatives: applying system-wide alternatives
> > > >
> > > > MPAM is not visible to Guest in this case.
> > > >
> > > > So as I pointed out earlier could it be a case where the ID register reports MPAM support
> > > > but the firmware has not enabled MPAM?
> > > >
> > > > James seems to be mentioning that case here,
> > > >
> > > > " (If you have a boot failure that bisects here its likely your CPUs
> > > > advertise MPAM in the id registers, but firmware failed to either enable
> > > > or MPAM, or emulate the trap as if it were disabled)"
> > >
> > > I tried to verify that MPAM is advertised with qemu+gdb method, as
> > > suggested by Oliver, but ID_AA64PFR0_EL1 register is not there.
> > >
> > > (gdb) i r ID_AA64PFR0_EL1
> > > Invalid register `ID_AA64PFR0_EL1'
> >
> > Then there is a bug in either QEMU or the GDB stubs. This register
> > exists, or you wouldn't be here.
>
>
> In case this is useful:
>
> builder@aarch64:/.in$ qemu-system-aarch64 --version
> QEMU emulator version 9.1.1 (qemu-9.1.1-alt2)
> Copyright (c) 2003-2024 Fabrice Bellard and the QEMU Project developers
> builder@aarch64:/.in$ gdb --version
> GNU gdb (GDB) 14.1.0.56.d739d4fd457-alt1 (ALT Sisyphus)
> Copyright (C) 2023 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
>
> Is there way to get content of this register with these possible
> gdb/qemu bugs?

I have no idea. And frankly, I don't think this matters.

> Perhaps, we can add some debugging print in guest kernel.
>
> > >
> > > Are there other suggestions?
> >
> > Mark has described what the problem is likely to be. 6.6-stable needs
> > to have 6685f5d572c22e10 backported, and it probably should have been
> > Cc: to stable. Can you please apply the following patch to your *host*
> > machine and retest?
>
> Unfortunately I cannot. But I can apply patches to the guest kernel. [I
> will try to convince admins of the server to apply the patch, though, but
> this can take time, and they can refuse since this is production build
> server and it's update procedure is complicated.]

Then I really cannot help you. I'm not going to paper over a
hypervisor bug in the guest kernel, and if you/they are happy to run
with critical bugs in your production machine, that's about it then.

M.

--
Without deviation from the norm, progress is not possible.