Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism

From: H. Peter Anvin
Date: Fri Aug 16 2024 - 18:59:50 EST

Next message: NeilBrown: "Re: [PATCH 1/1] SUNRPC: convert RPC_TASK_* constants to enum"
Previous message: Michael Ellerman: "Re: [PATCH v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest"
In reply to: H. Peter Anvin: "Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism"
Next in thread: H. Peter Anvin: "Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 8/16/24 14:26, H. Peter Anvin wrote:

On 8/16/24 11:40, Andrew Cooper wrote:

As the CALL instruction is 5-byte long, and we need to pad nop for both
WRMSR and WRMSRNS, what about not using segment prefix at all?

You can use up to 4 prefixes of any kind (which includes opcode prefixes before 0F) before most decoders start hurting, so we can pad it out to 5 bytes by doing 3f 3f .. .. ..

My suggestion, not that I've had time to experiment, was to change
paravirt to use a non-C ABI and have asm_xen_write_msr() recombine
edx:eax into rsi. That way the top level wrmsr() retains sensible
codegen for native even when paravirt is active.

I have attached what should be an "obvious" example... famous last words.

After looking at what xen_do_write_msr looks like, I realized we can do *much* better than that. This means using two alternative sequences, one for native/Xen and the other for native wrmsr/wrmsrns.

The call/jz sequence is *exactly* the same length as mov, shr, which means that xen_do_write_msr can simply return a flag if it wants the MSR access to be performed natively.

An important point is that in no code path can the error code be set to -EIO except by native MSR invocation, so the only possibilities are "done successfully" or "execute natively." [That being said, that would not be *that* hard to fix if needed.]

The new C prototype for xen_do_write_msr() becomes:

bool xen_do_write_msr(uint32_t msr, uint64_t val)

... with a true return meaning "execute natively."

(I changed the order of the arguments from my last version since it is more consistent with what is used everywhere else.)

RDMSR is a bit trickier. I think the best option there is to set up a new permission trap handler type that amounts to "get the address from the stack, apply a specific offset, and invoke the fixup handler for that address:

case EX_TYPE_UPLEVEL: {
/* Let reg hold the unsigned number of machine
* words to pop off the stack before the return
* address, and imm the signed offset from the
* return address to the desired trap point.
*
* pointer in units of machine words, and imm the
* signed offset from this stack word...
*/
unsigned long *sp = (unsigned long *)regs->sp + reg;
regs->ip = *sp++ + (int16_t)imm;
regs->sp = (unsigned long)sp;
goto again; /* Loop back to the beginning */
}

Again, "obviously correct" code attached.

-hpa

NOTE:

I had to dig several levels down to uncover actual situation, but pmu_msr_write() is actually a completely pointless function: all it does is shuffle some arguments, then calls pmu_msr_chk_emulated() and if it returns true AND the emulated flag is clear then does *exactly the same thing* that the calling code would have done if pmu_msr_write() itself had returned true... in other words, the whole function could be replaced by:

bool pmu_msr_write(uint32_t msr, uint64_t val)
{
bool emulated;

return pmu_msr_chk_emulated(msr, &val, false, &emulated) &&
emulated;
}

pmu_msr_read() does the equivalent stupidity; the obvious fix(es) to pmu_msr_chk_emulated() I hope are obvious.

/*
* Input in %rax, MSR number in %ecx
*
* %edx is set up to match %rax >> 32 like the native stub
* is expected to do
*
* Change xen_do_write_msr to return a true value if
* the MSR access should be executed natively (or vice versa,
* if you prefer.)
*
* bool xen_do_write_msr(uint32_t msr, uint64_t value)
*
* Let the native pattern look like:
*
* 48 89 c2 mov %rax,%rdx
* 48 c1 ea 20 shr $32,%rdx
* 3e 0f 30 ds wrmsr <--- trap point
*
* ... which can be replaced with ...
* 48 89 c2 mov %rax,%rdx
* 48 c1 ea 20 shr $32,%rdx
* 0f 01 c6 wrmsrns <--- trap point
*
* FOR XEN, replace the FIRST SEVEN BYTES with:
* e8 xx xx xx xx call asm_xen_write_msr
* 74 03 jz .+5
*
* If ZF=0 then this will fall down to the actual native WRMSR[NS]
* instruction.
*
* This also removes the need for Xen to maintain different safe and
* unsafe MSR routines, as the difference is handled by the same
* trap handler as is used natively.
*/
SYM_FUNC_START(asm_xen_write_msr)
FRAME_BEGIN
push %rax /* Save in case of native fallback */
push %rcx
push %rsi
push %rdi
push %r8
push %r9
push %r10
push %r11
mov %ecx,%edi /* MSR number */
mov %rax,%rsi /* MSR data */
call xen_do_write_msr
test %al,%al /* ZF=1 means we are done */
pop %r11
pop %r10
pop %r9
pop %r8
pop %rdi
pop %rsi
pop %rcx
mov 4(%rsp),%edx /* Set up %edx for native execution */
pop %rax
FRAME_END
RET
SYM_FUNC_END(asm_xen_write_msr)

/*
* RDMSR code. It isn't quite as clean; it requires a new trap handler
* type:
*
* case EX_TYPE_UPLEVEL: {
* /* Let reg hold the unsigned number of machine
* * words to pop off the stack before the return
* * address, and imm the signed offset from the
* * return address to the desired trap point.
* *
* * pointer in units of machine words, and imm the
* * signed offset from this stack word...
* * /
* unsigned long *sp = (unsigned long *)regs->sp + reg;
* regs->ip = *sp++ + (int16_t)imm;
* regs->sp = (unsigned long)sp;
* goto again; /* Loop back to the beginning * /
* }
*
* The prototype of the Xen C code:
* struct { uint64_t val, bool native } xen_do_read_msr(uint32_t msr)
*
* Native code:
* 0f 32 rdmsr <--- trap point
* 48 c1 e2 20 shl $0x20,%rdx
* 48 09 d0 or %rdx,%rax
*
* Xen code (cs rex is just for padding)
* 2e 40 e8 xx xx xx xx cs rex call asm_xen_read_msr
*/

SYM_FUNC_START(asm_xen_read_msr)
FRAME_BEGIN
push %rcx
push %rsi
push %rdi
push %r8
push %r9
push %r10
push %r11
mov %ecx,%edi /* MSR number */
call xen_do_read_msr
test %dl,%dl /* ZF=1 means we are done */
pop %r11
pop %r10
pop %r9
pop %r8
pop %rdi
pop %rsi
pop %rcx
jz 2f
1:
rdmsr
_ASM_EXTABLE_TYPE(1b, 2f, \
EX_TYPE_UPLEVEL|EX_DATA_IMM(-7)|EX_DATA_REG(0))
shl $32,%rdx
or %rdx,%rax
/*
* The top of the stack points directly at the return address;
* back up by 7 bytes from the return address.
*/
2:
FRAME_END
RET
SYM_FUNC_END(asm_xen_read_msr)

Next message: NeilBrown: "Re: [PATCH 1/1] SUNRPC: convert RPC_TASK_* constants to enum"
Previous message: Michael Ellerman: "Re: [PATCH v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest"
In reply to: H. Peter Anvin: "Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism"
Next in thread: H. Peter Anvin: "Re: [PATCH v1 2/3] x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]