Re: [PATCH 08/11] KVM: implement multiple address spaces

From: Paolo Bonzini
Date: Wed May 20 2015 - 03:07:37 EST




On 19/05/2015 20:28, Radim KrÄmÃÅ wrote:
> If I understand your aim correctly, we could go with just one dirty map
> in this way:
> - have a dirty bitmap that covers userspace memory of all slots that
> have enabled dirty log
> - store an offset to this bitmap in those slots
>
> It would be a framework that allows userspace to query dirty pages based
> on userspace address space. (Changing slots would be costly, but I
> don't think it's done very often.)

Yes, that would work. However, it would be a significant change to the
KVM dirty bitmap model. Right now it is GPA-indexed, and it would
become HVA-indexed.

The difference arises if the same memory is aliased at different parts
of the address space. Aliasing can happen if the 0xA0000 VRAM is a
window inside a larger framebuffer, for example; or it can happen across
separate address spaces. You cannot be sure that *some* userspace is
relying on the properties of a GPA-indexed dirty bitmap. Certainly not
QEMU, but we try to be userspace-agnostic.

>> On the
>> contrary, it's very easy if the VCPU can simply query its current memslots.
>
> Yeah, userspace drew the short stick.

Userspace is the slow path, so it makes sense that it did. :)

> I didn't want to solve the "multiple bitmaps for same userspace" because
> KVM_GET_DIRTY_LOG doesn't care about userspace memory space dirtiness
> (thought it's probably what we use it for), but only about the guest
> address space; KVM_GET_DIRTY_LOG just says which which page was
> modified in a slot :/
> (If we wanted dirty bitmap for userspace memory, it would be better to
> use the solution at the top.)

Userspace implements dirty bitmap for userspace memory as the OR of the
KVM dirty bitmap with its own dirty bitmap. There's no particular
advantage to a GPA-indexed dirty bitmap; as you noticed, GPA and HVA
indexing can be equally fast because GPA->HVA mapping is just an add.

It's just that the dirty bitmap it's always been GPA-indexed, and it's
too late to change it.

>> A possible optimization is to set a flag when no bits are set in the
>> dirty bitmap, and skip the iteration. This is the common case for the
>> SMM memory space for example. But it can be done later, and is an
>> independent work.
>
> True. (I'm not a huge fan of optimizations through exceptions.)

It depends on how often the exception kicks in... :)

>> Keeping slots separate for different address spaces also makes the most
>> sense in QEMU, because each address space has a separate MemoryListener
>> that only knows about one address space. There is one log_sync callback
>> per address space and no code has to know about all address spaces.
>
> Ah, I would have thought this is done per slot, when there was no
> concept of address space in KVM before this patch :)

Right; the callback is _called_ per slot, but _registered_ per address
space. Remember that QEMU does have address spaces.

>> The regular and SMM address spaces are not hierarchical. As soon as you
>> put a PCI resource underneath SMRAM---which is exactly what happens for
>> legacy VRAM at 0xa0000---they can be completely different. Note that
>> QEMU can map legacy VRAM as a KVM memslot when using the VGA 320x200x256
>> color mode (this mapping is not correct from the VGA point of view, but
>> it cannot be changed in QEMU without breaking migration).
>
> How is a PCI resource under SMRAM accessed?
> I thought that outside SMM, PCI resource under SMRAM is working
> normally, but it will be overshadowed, and made inaccessible, in SMM.

Yes, it is. (There is some chipset magic to make instruction fetches
retrieve SMRAM and data fetches retrieve PCI resources. I guess you
could use execute-only EPT permissions, but needless to say, we don't care).

> I'm not sure if we mean the same hierarchy. I meant hierarchy in the
> sense than one address space is considered before the other.
> (Maybe layers would be a better word.)
> SMM address space could have just one slot and be above regular, we'd
> then decide how to handle overlapping.

Ah, now I understand. That would be doable.

But as they say, "All programming is an exercise in caching." In this
case, the caching is done by userspace.

QEMU implements the SMM address space exactly by overlaying SMRAM over
normal memory:

/* Outer container */
memory_region_init(&smram_as_root, OBJECT(kvm_state),
"mem-container-smm", ~0ull);
memory_region_set_enabled(&smm_as_root, true);

/* Normal memory with priority 0 */
memory_region_init_alias(&smm_as_mem, OBJECT(kvm_state), "mem-smm",
get_system_memory(), 0, ~0ull);
memory_region_add_subregion_overlap(&smm_as_root, 0, &smm_as_mem, 0);
memory_region_set_enabled(&smm_as_mem, true);

/* SMRAM with priority 10 */
memory_region_add_subregion_overlap(&smm_as_root, 0, smram, 10);
memory_region_set_enabled(smram, true);

The caching consists simply in resolving the overlaps beforehand, thus
giving KVM the complete address space.

Since slots do not change often, the simpler code is not worth the
potentially more expensive KVM_SET_USER_MEMORY_REGION (it _is_ more
expensive, if only because it has to be called twice per slot change).

>> What I do dislike in the API is the 16-bit split of the slots field; but
>> the alternative of defining KVM_SET_USER_MEMORY_REGION2 and
>> KVM_GET_DIRTY_LOG2 ioctls is just as sad. If you prefer it that way, I
>> can certainly do that.
>
> No, what we have now is much better. I'd like to know why a taken
> route is better than a one I'd prefer, which probably makes me look
> like I have a problem with everything, but your solution is neat.

No problem, it's good because otherwise I take too much advantage of,
well, being able to commit to kvm.git. :)

(Also, it's not entirely my solution. My solution was v1 and the
userspace patches showed that it was a mess and a dead end... Compared
to v1, this one requires much more care in the MMU. The same gfn can
have two completely different mapping and you have to use the right one
when e.g. accessing the rmap. But that's pretty much the only downside).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/