Re: [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)

From: James Houghton
Date: Fri Sep 06 2024 - 22:27:11 EST


On Fri, Sep 6, 2024 at 5:53 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Fri, Sep 06, 2024, James Houghton wrote:
> > On Fri, Aug 9, 2024 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > Add two phases to mmu_stress_test to verify that KVM correctly handles
> > > guest memory that was writable, and then made read-only in the primary MMU,
> > > and then made writable again.
> > >
> > > Add bonus coverage for x86 to verify that all of guest memory was marked
> > > read-only. Making forward progress (without making memory writable)
> > > requires arch specific code to skip over the faulting instruction, but the
> > > test can at least verify each vCPU's starting page was made read-only.
> > >
> > > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> >
> > Writing off-list because I just have some smaller questions that I
> > don't want to bother the list with.
>
> Pulling everyone and the lists back in :-)
>
> IMO, no question is too small for kvm@, and lkml@ is gigantic firehose that's 99%
> archival and 1% list, at best. Odds are very, very good that if you have a
> question, however trivial or small, then someone else has the exact same question,
> or _will_ have the question in the future.
>
> I strongly prefer that all questions, review, feedback, etc. happen on list, even
> if the questions/feedback may seem trivial or noisy. The only exception is if
> information can't/shouldn't be made public, e.g. because of an embargo, NDA,
> security implications, etc.

I'll keep this in mind, thanks!

> > For the next version, feel free to add:
> >
> > Reviewed-by: James Houghton <jthoughton@xxxxxxxxxx>
> >
> > All of the selftest patches look fine to me.
> >
> > > ---
> > > tools/testing/selftests/kvm/mmu_stress_test.c | 87 ++++++++++++++++++-
> > > 1 file changed, 84 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > index 50c3a17418c4..98f9a4660269 100644
> > > --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> > > +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> > > @@ -16,6 +16,8 @@
> > > #include "guest_modes.h"
> > > #include "processor.h"
> > >
> > > +static bool mprotect_ro_done;
> > > +
> > > static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> > > {
> > > uint64_t gpa;
> > > @@ -31,6 +33,33 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> > > *((volatile uint64_t *)gpa);
> > > GUEST_SYNC(2);
> > >
> > > + /*
> > > + * Write to the region while mprotect(PROT_READ) is underway. Keep
> > > + * looping until the memory is guaranteed to be read-only, otherwise
> > > + * vCPUs may complete their writes and advance to the next stage
> > > + * prematurely.
> > > + */
> > > + do {
> > > + for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> > > +#ifdef __x86_64__
> > > + asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory");
> >
> > Ok so this appears to be a `mov BYTE PTR [rax + 0x0], 0x0`, where %rax = gpa. :)
> >
> > Does '0xc6,0x0,0x0' also work? It seems like that translates to `mov
> > BYTE PTR [rax], 0x0`. (just curious, no need to change it)
>
> LOL, yes, but as evidenced by the trailing comment, my intent was to generate
> "mov rax, [rax]", not "movb $0, [rax]". I suspect I was too lazy to consult the
> SDM to recall the correct opcode and simply copied an instruction from some random
> disassembly output without looking too closely at the output.
>
> asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
>
> > And I take it you wrote it out like this (instead of using mnemonics)
> > so that you could guarantee that IP + 4 would be the right way to skip
> > forwards. Does it make sense to leave a comment about that?
>
> Yes and yes.
>
> > The translation from mnemonic --> bytes won't change...
>
> Heh, this is x86, by no means is that guaranteed. E.g. see the above, where the
> same mnemonic can be represented multiple ways.
>
> > so could you just write the proper assembly? (not a request, just curious)
>
> In practice, probably. But the rules for inline assembly are, at best, fuzzy.
> So long as the correct instruction is generated, the assembler has quite a bit
> of freedom.
>
> E.g. similar to above, "mov %rax,(%rax)" can (should) be encoded as:
>
> 48 89 00
>
> but can also be encoded as
>
> 48 89 40 00
>
> Now, it's _extremely_ unlikely a compiler will actually generate the latter, but
> it's perfectly legal to do so. E.g. with gcc-13, this
>
> mov %rax, 0x0(%rax)
>
> generates
>
> 48 89 00
>
> even though a more literal interpretation would be
>
> 48 89 40 00

Oh... neat! I'm glad I asked about this.

>
> So yeah, while the hand-coded opcode is gross and annoying, given that a failure
> due to the "wrong" instruction being generated would be painful and time consuming
> to debug, hand-coding is worth avoiding the risk and potential pain if the compiler
> decides to be mean :-)

I 100% agree with hand-writing the opcode given what you've said. :)

> > A comment that 0x40 corresponds to %rax and that "a" also corresponds
> > to %rax would have been helpful for me. :)
>
> Eh, I get what you're saying, but giving a play-by-play of the encoding isn't
> really all that reasonable because _so_ much information needs to be conveyed to
> capture the entire picture, and some things are essentially table stakes when it
> comes to x86 kernel programming.
>
>
> E.g. 0x40 doesn't simply mean "(%rax)", it's a full ModR/M that defines the
> addressing mode, which in turn depends on the operating mode (64-bit).
>
> And "a" isn't just %rax; it's specifically an input register constraint, e.g. is
> distinctly different than:
>
> asm volatile(".byte 0x48,0x89,0x0" : "+a"(gpa) :: "memory"); /* mov %rax, (%rax) */
>
> even though in this specific scenario they generate the same code.
>
> And with the correct "48 89 00", understanding the full encoding requires describing
> REX prefixes, which are a mess unto themselves.
>
> So, a trailing comment (with the correct mnemonic) is all I'm willing to do, even
> though I 100% agree that it puts a decent sized load on the reader. There's just
> _too_ much information to communicate to the reader, at least for x86.

The trailing comment works for me! Thanks for all the detail -- I am
learning so much.

>
> > > +#else
> > > + vcpu_arch_put_guest(*((volatile uint64_t *)gpa), gpa);
> > > +#endif
> > > + } while (!READ_ONCE(mprotect_ro_done));
> > > +
> > > + /*
> > > + * Only x86 can explicitly sync, as other architectures will be stuck
> > > + * on the write fault.
> >
> > It would also have been a little clearer if the comment also said how
> > this is just because the test has been written to increment for the PC
> > upon getting these write faults *for x86 only*. IDK something like
> > "For x86, the test will adjust the PC for each write fault, allowing
> > the above loop to complete. Other architectures will get stuck, so the
> > #3 sync step is skipped."
>
> Ya. Untested, but how about this?

LGTM! (I haven't tested it either.)

>
> diff --git a/tools/testing/selftests/kvm/mmu_stress_test.c b/tools/testing/selftests/kvm/mmu_stress_test.c
> index 2d66c2724336..29acb22ea387 100644
> --- a/tools/testing/selftests/kvm/mmu_stress_test.c
> +++ b/tools/testing/selftests/kvm/mmu_stress_test.c
> @@ -38,11 +38,18 @@ static void guest_code(uint64_t start_gpa, uint64_t end_gpa, uint64_t stride)
> * looping until the memory is guaranteed to be read-only, otherwise
> * vCPUs may complete their writes and advance to the next stage
> * prematurely.
> + *
> + * For architectures that support skipping the faulting instruction,
> + * generate the store via inline assembly to ensure the exact length
> + * of the instruction is known and stable (vcpu_arch_put_guest() on
> + * fixed-length architectures should work, but the cost of paranoia
> + * is low in this case). For x86, hand-code the exact opcode so that
> + * there is no room for variability in the generated instruction.
> */
> do {
> for (gpa = start_gpa; gpa < end_gpa; gpa += stride)
> #ifdef __x86_64__
> - asm volatile(".byte 0xc6,0x40,0x0,0x0" :: "a" (gpa) : "memory"); /* MOV RAX, [RAX] */
> + asm volatile(".byte 0x48,0x89,0x00" :: "a"(gpa) : "memory"); /* mov %rax, (%rax) */

FWIW I much prefer the trailing comment you have ended up with vs. the
one you had before. (To me, the older one _seems_ like it's Intel
syntax, in which case the comment says it's a load..? The comment you
have now is, to me, obviously indicating a store. Though... perhaps
"movq"?)


> #elif defined(__aarch64__)
> asm volatile("str %0, [%0]" :: "r" (gpa) : "memory");
> #else
> @@ -163,7 +170,7 @@ static void *vcpu_worker(void *data)
> TEST_ASSERT_EQ(errno, EFAULT);
> #ifdef __x86_64__
> WRITE_ONCE(vcpu->run->kvm_dirty_regs, KVM_SYNC_X86_REGS);
> - vcpu->run->s.regs.regs.rip += 4;
> + vcpu->run->s.regs.regs.rip += 3;
> #endif
> #ifdef __aarch64__
> vcpu_set_reg(vcpu, ARM64_CORE_REG(regs.pc),
>