RE: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate

From: Yao, Yuan
Date: Wed Oct 12 2022 - 23:35:47 EST


>-----Original Message-----
>From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>Sent: Wednesday, October 12, 2022 06:24
>To: linux-kernel@xxxxxxxxxxxxxxx
>Cc: Bae, Chang Seok <chang.seok.bae@xxxxxxxxx>; x86@xxxxxxxxxx; Yao, Yuan <yuan.yao@xxxxxxxxx>; Hansen, Dave
><dave.hansen@xxxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>Subject: [PATCH] x86/fpu: Remove dynamic features from xcomp_bv for init_fpstate
>
>From: Yuan Yao <yuan.yao@xxxxxxxxx>
>
>This was found a couple of months ago in a big old AMX
>backport. But, it looks to be a problem in mainline too.
>Please let me know if this looks OK. I'd also especially
>appreciate some testing from folks that have AMX hardware
>handy.
>
>Builds and survives a quick boot test on non-AMX hardware.
>
>--
>
>== Background ==
>
>'init_fpstate' is a sort of template for all of the fpstates
>that come after it. It is copied when new processes are
>execve()'d or XRSTOR'd to get fpregs into a known state.
>
>That means that it represents the *starting* state for a
>process's fpstate which includes only the 'default' features.
>Dynamic features can, of course, be acquired later, but
>processes start with only default_features.
>
>During boot the kernel decides whether all fpstates will be
>kept in the compacted or uncompacted format. This choice is
>communicated to the hardware via the XCOMP_BV field in all
>XSAVE buffers, including 'init_fpstate'.
>
>== Problem ==
>
>But, the existing XCOMP_BV calculation is incorrect. It uses
>the set of 'max_features', not the default features.
>
>As a result, when XRSTOR operates on buffers derived from
>'init_fpstate', it may attempt to "tickle" dynamic features which
>are at offsets for which there is no space allocated in the
>fpstate.
>
>== Scope ==
>
>This normally results in a relatively harmless out-of-bounds
>memory read. It's harmless because it never gets consumed. But,
>if the fpstate is next to some unmapped memory, this "tickle" can
>cause a page fault and an oops.
>
>This only causes issues on systems when dynamic features are
>available and when an XSAVE buffer is next to uninitialized
>memory. In other words, it only affects recent Intel server
>CPUs, and in relatively few memory locations.
>
>Those two things are why it took relatively long to catch this.
>
>== Solution ==
>
>Use 'default_features' to establish the init_fpstate
>xcomp_bv value. Reset individual fpstate xcomp_bv values
>when the rest of the fpstate is reset.
>
>[ dhansen: add reset code from tglx, rewrites
> commit message and comments ]
>
>Fixes: 1c253ff2287f ("x86/fpu: Move xstate feature masks to fpu_*_cfg")
>Suggested-by: Dave Hansen <dave.hansen@xxxxxxxxx>
>Suggested-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>Signed-off-by: Yuan Yao <yuan.yao@xxxxxxxxx>
>Cc: stable@xxxxxxxxxxxxxxx
>---
> arch/x86/kernel/fpu/core.c | 3 +++
> arch/x86/kernel/fpu/xstate.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>index 3b28c5b25e12..4d64de34da12 100644
>--- a/arch/x86/kernel/fpu/core.c
>+++ b/arch/x86/kernel/fpu/core.c
>@@ -526,6 +526,9 @@ static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
> fpstate->xfeatures = fpu_kernel_cfg.default_features;
> fpstate->user_xfeatures = fpu_user_cfg.default_features;
> fpstate->xfd = xfd;
>+
>+ /* Ensure that xcomp_bv matches ->xfeatures */
>+ xstate_init_xcomp_bv(&fpstate->regs.xsave, fpstate->xfeatures);
> }
>
> void fpstate_reset(struct fpu *fpu)
>diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>index c8340156bfd2..f9f45610c72f 100644
>--- a/arch/x86/kernel/fpu/xstate.c
>+++ b/arch/x86/kernel/fpu/xstate.c
>@@ -360,7 +360,12 @@ static void __init setup_init_fpu_buf(void)
>
> print_xstate_features();
>
>- xstate_init_xcomp_bv(&init_fpstate.regs.xsave, fpu_kernel_cfg.max_features);
>+ /*
>+ * 'init_fpstate' is sized for the default feature
>+ * set without any of the dynamic features.
>+ */
>+ xstate_init_xcomp_bv(&init_fpstate.regs.xsave,
>+ fpu_kernel_cfg.default_features);

Below trace is observed on host kernel with this patch applied when create VM.

The reason is __copy_xstate_to_uabi_buf() copies data from &init_fpstate when the component
is not existed in the source kernel fpstate (here is the AMX tile component), but the
AMX TILE bit is removed from init_fpstate due to this patch, so the WARN is triggered and return
NULL which causes kernel NULL pointer dereference later.

I considered 2 possible ways to fix this without big change:
1. For such non-exist component, we can fill the buffer in target fpstate to 0 and don't WARN if the bit is not set
in init_fpstate.
2. Enlarge the init_fpstate to content the dynamic components and still keel fpu_kernel_cfg.max_features
in init_fpstate (but still remove the dynamic bit from new cloned thread), so we can use it safely in above case,
but near 2 pages (8K) is wasted.

[ 100.989998] ------------[ cut here ]------------
[ 100.995332] WARNING: CPU: 109 PID: 2910 at arch/x86/kernel/fpu/xstate.c:944 __raw_xsave_addr+0x49/0x50
[ 101.005948] Modules linked in: kvm_intel kvm x86_pkg_temp_thermal snd_pcm input_leds snd_timer led_class joydev mac_hid irqbypass efi_pstore snd soundcore button sch_fq_codel ip_tables x_tables ixgbe mdio mdio_devres libphy igc [last unloaded: kvm]
[ 101.030924] CPU: 109 PID: 2910 Comm: CPU 0/KVM Tainted: G I 6.0.0-rc6-kmv-upstream-queue+ #23
[ 101.054199] RIP: 0010:__raw_xsave_addr+0x49/0x50
[ 101.059511] Code: 85 15 b3 44 39 01 74 1c 0f 1f 44 00 00 48 0f a3 f7 73 17 e8 d9 fe ff ff 89 c0 48 01 d8 5b 5d c3 cc cc cc cc 0f 0b 31 c0 eb f3 <0f> 0b 31 c0 eb ed 90 0f 1f 44 00 00 55 49 89 d0 48 89 f1 48 89 e5
[ 101.080865] RSP: 0018:ffa000000b8cbc30 EFLAGS: 00010206
[ 101.086870] RAX: 0000000000002000 RBX: ffffffff823be3c0 RCX: 0000000000000012
[ 101.095038] RDX: 0000000000040000 RSI: 0000000000000012 RDI: 80000000000206e7
[ 101.103177] RBP: ffa000000b8cbc38 R08: ffffffff823bf400 R09: 0000000000000001
[ 101.111350] R10: ffa000000b8cbc70 R11: ff110008ef3f4008 R12: ff110008ef3f4b00
[ 101.119500] R13: 0000000000002000 R14: 0000000000000012 R15: 0000000000000012
[ 101.127678] FS: 00007fca15639700(0000) GS:ff1100104fd40000(0000) knlGS:0000000000000000
[ 101.136922] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.143512] CR2: 00007fca202a9001 CR3: 00000008f49a8005 CR4: 0000000000773ee0
[ 101.151658] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 101.159817] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 101.167961] PKRU: 55555554
[ 101.171110] Call Trace:
[ 101.173972] <TASK>
[ 101.176435] __copy_xstate_to_uabi_buf+0x33b/0x870
[ 101.181936] fpu_copy_guest_fpstate_to_uabi+0x28/0x80
[ 101.187720] kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
[ 101.196409] ? __this_cpu_preempt_check+0x13/0x20
[ 101.204924] ? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
[ 101.213531] kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.221507] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.229661] ? __fget_light+0xd4/0x130
[ 101.237011] __x64_sys_ioctl+0xe3/0x910
[ 101.244395] ? debug_smp_processor_id+0x17/0x20
[ 101.252463] ? fpregs_assert_state_consistent+0x27/0x50
[ 101.261228] do_syscall_64+0x3f/0x90
[ 101.268090] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 101.276574] RIP: 0033:0x7fca1b8f362b
[ 101.283312] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 01 48
[ 101.312528] RSP: 002b:00007fca15638488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 101.323793] RAX: ffffffffffffffda RBX: 000055b3c88ea9c0 RCX: 00007fca1b8f362b
[ 101.334583] RDX: 00007fca0c001000 RSI: ffffffff9000aecf RDI: 000000000000000e
[ 101.345320] RBP: 00007fca15638580 R08: 000055b3c80d95f0 R09: 000055b3c8820320
[ 101.355994] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeed19156e
[ 101.366611] R13: 00007ffeed19156f R14: 00007ffeed191630 R15: 00007fca15638880
[ 101.377221] </TASK>
[ 101.382225] ---[ end trace 0000000000000000 ]---
[ 101.390004] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 101.400468] #PF: supervisor read access in kernel mode
[ 101.408908] #PF: error_code(0x0000) - not-present page
[ 101.417319] PGD 88bd3e067 P4D 0
[ 101.423541] Oops: 0000 [#1] PREEMPT SMP
[ 101.430410] CPU: 109 PID: 2910 Comm: CPU 0/KVM Tainted: G W I 6.0.0-rc6-kmv-upstream-queue+ #23
[ 101.463968] RIP: 0010:memcpy_erms+0x6/0x10
[ 101.471389] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 cc cc cc cc 66 90 48 89 f8 48 89 d1 <f3> a4 c3 cc cc cc cc 0f 1f 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 101.501238] RSP: 0018:ffa000000b8cbc40 EFLAGS: 00010246
[ 101.510134] RAX: ff110008ef3f4b00 RBX: 0000000000002000 RCX: 0000000000002000
[ 101.521233] RDX: 0000000000002000 RSI: 0000000000000000 RDI: ff110008ef3f4b00
[ 101.532327] RBP: ffa000000b8cbce0 R08: ffffffff823bf400 R09: 0000000000000001
[ 101.543425] R10: ffa000000b8cbc70 R11: ff110008ef3f4008 R12: ff110008ef3f6b00
[ 101.554530] R13: 0000000000000000 R14: 0000000000000012 R15: 0000000000000012
[ 101.565652] FS: 00007fca15639700(0000) GS:ff1100104fd40000(0000) knlGS:0000000000000000
[ 101.577915] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 101.587555] CR2: 0000000000000000 CR3: 00000008f49a8005 CR4: 0000000000773ee0
[ 101.598786] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 101.609993] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 101.621163] PKRU: 55555554
[ 101.627358] Call Trace:
[ 101.633163] <TASK>
[ 101.638474] ? __copy_xstate_to_uabi_buf+0x381/0x870
[ 101.646977] fpu_copy_guest_fpstate_to_uabi+0x28/0x80
[ 101.655526] kvm_arch_vcpu_ioctl+0x14c/0x1460 [kvm]
[ 101.663832] ? __this_cpu_preempt_check+0x13/0x20
[ 101.671938] ? vmx_vcpu_put+0x2e/0x260 [kvm_intel]
[ 101.680146] kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.687746] ? kvm_vcpu_ioctl+0xea/0x6b0 [kvm]
[ 101.695513] ? __fget_light+0xd4/0x130
[ 101.702466] __x64_sys_ioctl+0xe3/0x910
[ 101.709507] ? debug_smp_processor_id+0x17/0x20
[ 101.717289] ? fpregs_assert_state_consistent+0x27/0x50
[ 101.725823] do_syscall_64+0x3f/0x90
[ 101.732497] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 101.740842] RIP: 0033:0x7fca1b8f362b
[ 101.747555] Code: 0f 1e fa 48 8b 05 5d b8 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2d b8 2c 00 f7 d8 64 89 01 48
[ 101.776677] RSP: 002b:00007fca15638488 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 101.787889] RAX: ffffffffffffffda RBX: 000055b3c88ea9c0 RCX: 00007fca1b8f362b
[ 101.798637] RDX: 00007fca0c001000 RSI: ffffffff9000aecf RDI: 000000000000000e
[ 101.809379] RBP: 00007fca15638580 R08: 000055b3c80d95f0 R09: 000055b3c8820320
[ 101.820038] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeed19156e
[ 101.830642] R13: 00007ffeed19156f R14: 00007ffeed191630 R15: 00007fca15638880
[ 101.841228] </TASK>
[ 101.846230] Modules linked in: kvm_intel kvm x86_pkg_temp_thermal snd_pcm input_leds snd_timer led_class joydev mac_hid irqbypass efi_pstore snd soundcore button sch_fq_codel ip_tables x_tables ixgbe mdio mdio_devres libphy igc [last unloaded: kvm]
[ 101.878964] CR2: 0000000000000000
[ 101.885472] ---[ end trace 0000000000000000 ]---

>
> /*
> * Init all the features state with header.xfeatures being 0x0
>--
>2.34.1