Re: supervisor write access in kernel mode in __pv_queued_spin_unlock_slowpath
From: Maxim Levitsky
Date: Sun Jan 01 2023 - 06:09:13 EST
On Sun, 2023-01-01 at 16:37 +0900, Hyeonggon Yoo wrote:
> On Sun, Jan 01, 2023 at 03:50:28PM +0900, Hyeonggon Yoo wrote:
> > On Sat, Dec 31, 2022 at 11:26:25PM +0800, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed kernel_BUG_at_include/linux/mm.h due to commit (built with gcc-11):
> > >
> > > commit: 0af8489b0216fa1dd83e264bef8063f2632633d7 ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > [test failed on linux-next/master c76083fac3bae1a87ae3d005b5cb1cbc761e31d5]
> > >
> > > in testcase: rcutorture
> > > version:
> > > with following parameters:
> > >
> > > runtime: 300s
> > > test: default
> > > torture_type: tasks-tracing
> > >
> > > test-description: rcutorture is rcutorture kernel module load/unload test.
> > > test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
> > >
> > >
> > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > >
> > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> > >
> > >
> > > If you fix the issue, kindly add following tag
> > > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > > Link: https://lore.kernel.org/oe-lkp/202212312021.bc1efe86-oliver.sang@xxxxxxxxx
> >
> > <snip>
> >
> > >
> > > To reproduce:
> > >
> > > # build kernel
> > > cd linux
> > > cp config-6.1.0-rc2-00014-g0af8489b0216 .config
> > > make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
> > > make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> > > cd <mod-install-dir>
> > > find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
> > >
> > >
> > > git clone https://github.com/intel/lkp-tests.git
> > > cd lkp-tests
> > > bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
> > >
> > > # if come across any failure that blocks the test,
> > > # please remove ~/.lkp and /lkp dir to run from a clean state.
> >
> > I was unable to reproduce in the same way as described above
> > because some files referenced in job-script couldn't be downloaded from
> > download.01.org/0day :(
> >
> > So I just built rcutorture module as builtin
> > and I got weird spinlock bug on commit: 0af8489b0216
> > ("mm, slub: remove percpu slabs with CONFIG_SLUB_TINY")
>
> (+Cc KVM/Paravirt experts)
>
> > full dmesg added as attachment
> >
> > [ 1387.564837][ T57] BUG: unable to handle page fault for address: c108f5f4
> > [ 1387.566649][ T57] #PF: supervisor write access in kernel mode
> > [ 1387.567965][ T57] #PF: error_code(0x0003) - permissions violation
> > [ 1387.569439][ T57] *pde = 010001e1
> > [ 1387.570276][ T57] Oops: 0003 [#1] SMP
> > [ 1387.571149][ T57] CPU: 2 PID: 57 Comm: rcu_torture_rea Tainted: G S 6.1.0-rc2-00010-g0af8489b0216 #2130 63d19ac2b985fca570c354d8750f489755de37ed
> > [ 1387.574673][ T57] EIP: kvm_kick_cpu+0x54/0x90
> > [ 1387.575802][ T57] Code: 2f c5 01 8b 04 9d e0 d4 4e c4 83 15 14 7b 2f c5 00 83 05 08 6d 2f c5 01 0f b7 0c 30 b8 05 00 00 00 83 15 0c 6d 2f c5 00 31 db <0f> 01 c1 83 05 10 6d 2f c5 01 8b 5d f8 8b 75 fc 83 15 14 6d 2f c5
^^^^^^
Yes this is the unfamous hypercall patching bug....
> >
So what is happening is that Intel and AMD has a *slightly* different instruction reserved for hypercalls
(paravirt calls from guest to host hypervisor).
KVM developers made a mistake to be 'nice' to the guests and if the guest uses the wrong hypercall instruction
the KVM attempts to rewrite it with the right instruction.
That can fail, because to avoid security issues, KVM uses the exact same security context as the instruction itself
(it is as if the instruction was defined such as it overwrote itself)
This means that is the guest memory is marked read-only in the guest paging, then the write will fail and #PF
will happen on the wrong hypercall instruction.
Here we have the Intel's instruction (VMCALL, 0f 01 C1), and the host machine is likely AMD which uses VMMCALL instruction
which is (0F 01 D9)
Now any recent Linux guest is supposed to use a right instruction using the alternatives mechanism, but it can if
the hypervisor passes 'non native' vendor id, like GenunineIntel on AMD machine.
In my testing using named CPU models like you do '-cpu SandyBridge' still passes through host vendor ID (that is the guest
will see Intel's cpu but with vendor='AutheticAMD') but nobody confirmed me that this is a bug or a feature and I am not
sure if older qemu versions also did this.
Assuming that your host machine is AMD,
your best bet to check if my theory is right is to boot the guest without triggering the bug,
and check in /proc/cpuinfo if the vendor string is 'GenuineIntel'
Best regards,
Maxim Levitsky
[ 1387.580456][ T57] EAX: 00000005 EBX: 00000000 ECX: 00000003 EDX: c108f5a0
> > [ 1387.582071][ T57] ESI: c5153580 EDI: 00000046 EBP: c69cddf8 ESP: c69cddf0
> > [ 1387.583775][ T57] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010046
> > [ 1387.585643][ T57] CR0: 80050033 CR2: c108f5f4 CR3: 0776b000 CR4: 00350e90
> > [ 1387.587492][ T57] Call Trace:
> > [ 1387.588365][ T57] __pv_queued_spin_unlock_slowpath+0x66/0x110
> > [ 1387.589898][ T57] __pv_queued_spin_unlock+0x4b/0x60
> > [ 1387.591040][ T57] __raw_callee_save___pv_queued_spin_unlock+0x9/0x10
> > [ 1387.592771][ T57] do_raw_spin_unlock+0x49/0xa0
> > [ 1387.593805][ T57] _raw_spin_unlock_irqrestore+0x53/0xd0
> > [ 1387.594927][ T57] swake_up_one+0x4f/0x70
> > [ 1387.595739][ T57] __rcu_report_exp_rnp+0x26b/0x470
> > [ 1387.596730][ T57] rcu_report_exp_cpu_mult+0x82/0x2f0
> > [ 1387.597770][ T57] rcu_qs+0xac/0x160
> > [ 1387.598503][ T57] rcu_note_context_switch+0x31/0x1e0
> > [ 1387.599460][ T57] __schedule+0xc5/0x770
> > [ 1387.600195][ T57] __cond_resched+0x7a/0x100
> > [ 1387.600996][ T57] stutter_wait+0x9e/0x2c0
> > [ 1387.601956][ T57] rcu_torture_reader+0x162/0x3e0
> > [ 1387.603048][ T57] ? rcu_torture_reader+0x3e0/0x3e0
> > [ 1387.604269][ T57] ? __kthread_parkme+0xab/0xf0
> > [ 1387.605420][ T57] kthread+0x167/0x1d0
> > [ 1387.606383][ T57] ? rcu_torture_read_exit_child+0xa0/0xa0
> > [ 1387.607516][ T57] ? kthread_exit+0x50/0x50
> > [ 1387.608517][ T57] ret_from_fork+0x19/0x24
> > [ 1387.609548][ T57] Modules linked in:
> > [ 1387.610187][ T57] CR2: 00000000c108f5f4
> > [ 1387.610873][ T57] ---[ end trace 0000000000000000 ]---
> > [ 1387.611829][ T57] EIP: kvm_kick_cpu+0x54/0x90
> > [ 1387.612653][ T57] Code: 2f c5 01 8b 04 9d e0 d4 4e c4 83 15 14 7b 2f c5 00 83 05 08 6d 2f c5 01 0f b7 0c 30 b8 05 00 00 00 83 15 0c 6d 2f c5 00 31 db <0f> 01 c1 83 05 10 6d 2f c5 01 8b 5d f8 8b 75 fc 83 15 14 6d 2f c5
> > [ 1387.616715][ T57] EAX: 00000005 EBX: 00000000 ECX: 00000003 EDX: c108f5a0
> > [ 1387.618242][ T57] ESI: c5153580 EDI: 00000046 EBP: c69cddf8 ESP: c69cddf0
> > [ 1387.619912][ T57] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010046
> > [ 1387.621666][ T57] CR0: 80050033 CR2: c108f5f4 CR3: 0776b000 CR4: 00350e90
> > [ 1387.623128][ T57] Kernel panic - not syncing: Fatal exception
> > [ 1389.285045][ T57] Shutting down cpus with NMI
> > [ 1389.297949][ T57] Kernel Offset: disabled
> > [ 1389.299174][ T57] ---[ end Kernel panic - not syncing: Fatal exception ]---