Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism
From: H. Peter Anvin
Date: Fri Aug 16 2024 - 18:59:50 EST
On 8/16/24 14:26, H. Peter Anvin wrote:
On 8/16/24 11:40, Andrew Cooper wrote:
As the CALL instruction is 5-byte long, and we need to pad nop for both
WRMSR and WRMSRNS, what about not using segment prefix at all?
You can use up to 4 prefixes of any kind (which includes opcode prefixes
before 0F) before most decoders start hurting, so we can pad it out to 5
bytes by doing 3f 3f .. .. ..
My suggestion, not that I've had time to experiment, was to change
paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
edx:eax into rsi. That way the top level wrmsr() retains sensible
codegen for native even when paravirt is active.
I have attached what should be an "obvious" example... famous last words.
After looking at what xen_do_write_msr looks like, I realized we can do
*much* better than that. This means using two alternative sequences, one
for native/Xen and the other for native wrmsr/wrmsrns.
The call/jz sequence is *exactly* the same length as mov, shr, which
means that xen_do_write_msr can simply return a flag if it wants the MSR
access to be performed natively.
An important point is that in no code path can the error code be set to
-EIO except by native MSR invocation, so the only possibilities are
"done successfully" or "execute natively." [That being said, that would
not be *that* hard to fix if needed.]
The new C prototype for xen_do_write_msr() becomes:
bool xen_do_write_msr(uint32_t msr, uint64_t val)
... with a true return meaning "execute natively."
(I changed the order of the arguments from my last version since it is
more consistent with what is used everywhere else.)
RDMSR is a bit trickier. I think the best option there is to set up a
new permission trap handler type that amounts to "get the address from
the stack, apply a specific offset, and invoke the fixup handler for
that address:
case EX_TYPE_UPLEVEL: {
/* Let reg hold the unsigned number of machine
* words to pop off the stack before the return
* address, and imm the signed offset from the
* return address to the desired trap point.
*
* pointer in units of machine words, and imm the
* signed offset from this stack word...
*/
unsigned long *sp = (unsigned long *)regs->sp + reg;
regs->ip = *sp++ + (int16_t)imm;
regs->sp = (unsigned long)sp;
goto again; /* Loop back to the beginning */
}
Again, "obviously correct" code attached.
-hpa
NOTE:
I had to dig several levels down to uncover actual situation, but
pmu_msr_write() is actually a completely pointless function: all it does
is shuffle some arguments, then calls pmu_msr_chk_emulated() and if it
returns true AND the emulated flag is clear then does *exactly the same
thing* that the calling code would have done if pmu_msr_write() itself
had returned true... in other words, the whole function could be
replaced by:
bool pmu_msr_write(uint32_t msr, uint64_t val)
{
bool emulated;
return pmu_msr_chk_emulated(msr, &val, false, &emulated) &&
emulated;
}
pmu_msr_read() does the equivalent stupidity; the obvious fix(es) to
pmu_msr_chk_emulated() I hope are obvious.
/*
* Input in %rax, MSR number in %ecx
*
* %edx is set up to match %rax >> 32 like the native stub
* is expected to do
*
* Change xen_do_write_msr to return a true value if
* the MSR access should be executed natively (or vice versa,
* if you prefer.)
*
* bool xen_do_write_msr(uint32_t msr, uint64_t value)
*
* Let the native pattern look like:
*
* 48 89 c2 mov %rax,%rdx
* 48 c1 ea 20 shr $32,%rdx
* 3e 0f 30 ds wrmsr <--- trap point
*
* ... which can be replaced with ...
* 48 89 c2 mov %rax,%rdx
* 48 c1 ea 20 shr $32,%rdx
* 0f 01 c6 wrmsrns <--- trap point
*
* FOR XEN, replace the FIRST SEVEN BYTES with:
* e8 xx xx xx xx call asm_xen_write_msr
* 74 03 jz .+5
*
* If ZF=0 then this will fall down to the actual native WRMSR[NS]
* instruction.
*
* This also removes the need for Xen to maintain different safe and
* unsafe MSR routines, as the difference is handled by the same
* trap handler as is used natively.
*/
SYM_FUNC_START(asm_xen_write_msr)
FRAME_BEGIN
push %rax /* Save in case of native fallback */
push %rcx
push %rsi
push %rdi
push %r8
push %r9
push %r10
push %r11
mov %ecx,%edi /* MSR number */
mov %rax,%rsi /* MSR data */
call xen_do_write_msr
test %al,%al /* ZF=1 means we are done */
pop %r11
pop %r10
pop %r9
pop %r8
pop %rdi
pop %rsi
pop %rcx
mov 4(%rsp),%edx /* Set up %edx for native execution */
pop %rax
FRAME_END
RET
SYM_FUNC_END(asm_xen_write_msr)
/*
* RDMSR code. It isn't quite as clean; it requires a new trap handler
* type:
*
* case EX_TYPE_UPLEVEL: {
* /* Let reg hold the unsigned number of machine
* * words to pop off the stack before the return
* * address, and imm the signed offset from the
* * return address to the desired trap point.
* *
* * pointer in units of machine words, and imm the
* * signed offset from this stack word...
* * /
* unsigned long *sp = (unsigned long *)regs->sp + reg;
* regs->ip = *sp++ + (int16_t)imm;
* regs->sp = (unsigned long)sp;
* goto again; /* Loop back to the beginning * /
* }
*
* The prototype of the Xen C code:
* struct { uint64_t val, bool native } xen_do_read_msr(uint32_t msr)
*
* Native code:
* 0f 32 rdmsr <--- trap point
* 48 c1 e2 20 shl $0x20,%rdx
* 48 09 d0 or %rdx,%rax
*
* Xen code (cs rex is just for padding)
* 2e 40 e8 xx xx xx xx cs rex call asm_xen_read_msr
*/
SYM_FUNC_START(asm_xen_read_msr)
FRAME_BEGIN
push %rcx
push %rsi
push %rdi
push %r8
push %r9
push %r10
push %r11
mov %ecx,%edi /* MSR number */
call xen_do_read_msr
test %dl,%dl /* ZF=1 means we are done */
pop %r11
pop %r10
pop %r9
pop %r8
pop %rdi
pop %rsi
pop %rcx
jz 2f
1:
rdmsr
_ASM_EXTABLE_TYPE(1b, 2f, \
EX_TYPE_UPLEVEL|EX_DATA_IMM(-7)|EX_DATA_REG(0))
shl $32,%rdx
or %rdx,%rax
/*
* The top of the stack points directly at the return address;
* back up by 7 bytes from the return address.
*/
2:
FRAME_END
RET
SYM_FUNC_END(asm_xen_read_msr)