x86/fpu/xsave: protection key test failures
From: Babu Moger
Date: Tue May 25 2021 - 17:37:16 EST
Hi All,
We have been noticing protection key test failures on our systems here.
Shggy(Dave Kleikamp) from oracle reported to this problem couple of weeks
ago. Here the is the failure.
# ./protection_keys_64
has pkeys: 1
startup pkey_reg: 0000000055555550
test 0 PASSED (iteration 1)
test 1 PASSED (iteration 1)
test 2 PASSED (iteration 1)
test 3 PASSED (iteration 1)
test 4 PASSED (iteration 1)
test 5 PASSED (iteration 1)
test 6 PASSED (iteration 1)
test 7 PASSED (iteration 1)
test 8 PASSED (iteration 1)
test 9 PASSED (iteration 1)
test 10 PASSED (iteration 1)
test 11 PASSED (iteration 1)
test 12 PASSED (iteration 1)
test 13 PASSED (iteration 1)
protection_keys_64: pkey-helpers.h:127: _read_pkey_reg: Assertion
`pkey_reg == shadow_pkey_reg' failed.
Aborted (core dumped)
The test that is failing is "test_ptrace_of_child".
Sometimes it fails even earlier also. Sometimes(very rarely) it
passes when I insert few printfs. Most probably it fails with
test_ptrace_of_child.
In the test "test_ptrace_of_child", the parent thread attaches to the
child and modifies the data structure associated with protection key.
Verifies the access results based on the key permissions. While running
the test, the tool finds the key permission changes out of nowhere. The
test fails with assert "assert(pkey_reg == shadow_pkey_reg);"
Investigation so far.
1. The test fails on AMD(Milan) systems. Test appears to pass on Intel
systems. Tested on old Intel system available here.
2. I was trying to see if the hardware(or firmware) is corrupting the pkru
register value. At this point, it does not appear that way. I was able to
trace all the MSR writes to the application or kernel. At this point, it
does not appear to me as an hardware issue. I see that kernel appears to
do save/restore properly during the context switch. This value stays
default(value 0x55555554) in most cases unless the application changes the
default permissions. Only application that changes here is
"protection_keys".
3. I played with the test tool little bit. The behavior changes
drastically when I make minor changes.
For example, in the following code.
void setup_handlers(void)
{
signal(SIGCHLD, &sig_chld);
setup_sigsegv_handler();
}
Just commenting the first line "signal(SIGCHLD, &sig_chld);" changes the
behavior drastically. I see some tests PASS after this change. The first
line appear to not do anything except some printing.
I still have not figured out completely what is going on with
setup_sigsegv_handler();. It seems very odd modifying some structures in
the signal handler. I am not sure about some of the offsets it is trying
to modify. I feel it may be messing up something there. I am not sure
though. Will have to investigate.
4. I took the traces(x86_fpu) while running test. It appears to show some
of the feature headers are not copied during the test. But I could not
figure out why it was happening. It appears to show not all the feature
headers are copied when the new child is created. When I printed the
feature bits, they all appear to be intact. Here is the trace.
protection_keys-17350 [035] 59275.833511: x86_fpu_regs_activated:
x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834197: x86_fpu_copy_src: x86/fpu:
0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834197: x86_fpu_copy_dst: x86/fpu:
0xffff93d722877800 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17351 [040] 59275.834274: x86_fpu_regs_activated:
x86/fpu: 0xffff93d722877800 load: 1 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834283: x86_fpu_regs_deactivated:
x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17351 [040] 59275.834289: x86_fpu_regs_deactivated:
x86/fpu: 0xffff93d722877800 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834296: x86_fpu_regs_activated:
x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834298: x86_fpu_regs_activated:
x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834406: x86_fpu_before_save: x86/fpu:
0xffff93d7595e2dc0 load: 0 xfeatures: 2 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834406: x86_fpu_after_save: x86/fpu:
0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834408: x86_fpu_before_save: x86/fpu:
0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834408: x86_fpu_after_save: x86/fpu:
0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834654: x86_fpu_regs_deactivated:
x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834654: x86_fpu_dropped: x86/fpu:
0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
auditd-1834 [036] 59275.834655:
x86_fpu_regs_activated: x86/fpu: 0xffff93d713fef800 load: 1 xfeatures: 202
xcomp_bv: 8000000000000207
protection_keys-17350 [035] 59275.834665: x86_fpu_regs_deactivated:
x86/fpu: 0xffff93d7595e2dc0 load: 0 xfeatures: 202 xcomp_bv: 8000000000000207
<...>-17285 [041] 59275.834679: x86_fpu_regs_activated:
x86/fpu: 0xffff93d716d0df40 load: 1 xfeatures: 202 xcomp_bv: 8000000000000207
5. I instrumented parent child interactions using a separate standalone
application(without the signal handler), it appears to work just fine. I
see PRKU register staying intact swiching from parent to child.
My suspicion at this point is towards the selftest tool protection_keys.c.
I will keep looking. Any feedback would be much appreciated to debug further.
Shaggy, Feel free to add if I missed something.
Thanks
Babu