Re: [PATCH 2/3] kvm: fix potentially corrupt mmio cache

From: David Matlack
Date: Tue Sep 02 2014 - 12:44:48 EST


On Fri, Aug 29, 2014 at 3:31 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> From: David Matlack <dmatlack@xxxxxxxxxx>
>
> vcpu exits and memslot mutations can run concurrently as long as the
> vcpu does not aquire the slots mutex. Thus it is theoretically possible
> for memslots to change underneath a vcpu that is handling an exit.
>
> If we increment the memslot generation number again after
> synchronize_srcu_expedited(), vcpus can safely cache memslot generation
> without maintaining a single rcu_dereference through an entire vm exit.
> And much of the x86/kvm code does not maintain a single rcu_dereference
> of the current memslots during each exit.
>
> We can prevent the following case:
>
> vcpu (CPU 0) | thread (CPU 1)
> --------------------------------------------+--------------------------
> 1 vm exit |
> 2 srcu_read_unlock(&kvm->srcu) |
> 3 decide to cache something based on |
> old memslots |
> 4 | change memslots
> | (increments generation)
> 5 | synchronize_srcu(&kvm->srcu);
> 6 retrieve generation # from new memslots |
> 7 tag cache with new memslot generation |
> 8 srcu_read_unlock(&kvm->srcu) |
> ... |
> <action based on cache occurs even |
> though the caching decision was based |
> on the old memslots> |
> ... |
> <action *continues* to occur until next |
> memslot generation change, which may |
> be never> |
> |
>
> By incrementing the generation after synchronizing with kvm->srcu readers,
> we ensure that the generation retrieved in (6) will become invalid soon
> after (8).
>
> Keeping the existing increment is not strictly necessary, but we
> do keep it and just move it for consistency from update_memslots to
> install_new_memslots. It invalidates old cached MMIOs immediately,
> instead of having to wait for the end of synchronize_srcu_expedited,
> which makes the code more clearly correct in case CPU 1 is preempted
> right after synchronize_srcu() returns.
>
> To avoid halving the generation space in SPTEs, always presume that the
> low bit of the generation is zero when reconstructing a generation number
> out of an SPTE. This effectively disables MMIO caching in SPTEs during
> the call to synchronize_srcu_expedited. Using the low bit this way is
> somewhat like a seqcount---where the protected thing is a cache, and
> instead of retrying we can simply punt if we observe the low bit to be 1.
>
> Cc: stable@xxxxxxxxxxxxxxx
> Cc: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> ---
> Documentation/virtual/kvm/mmu.txt | 14 ++++++++++++++
> arch/x86/kvm/mmu.c | 20 ++++++++++++--------
> virt/kvm/kvm_main.c | 23 ++++++++++++++++-------
> 3 files changed, 42 insertions(+), 15 deletions(-)
>
> diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
> index 290894176142..53838d9c6295 100644
> --- a/Documentation/virtual/kvm/mmu.txt
> +++ b/Documentation/virtual/kvm/mmu.txt
> @@ -425,6 +425,20 @@ fault through the slow path.
> Since only 19 bits are used to store generation-number on mmio spte, all
> pages are zapped when there is an overflow.
>
> +Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
> +times, the last one happening when the generation number is retrieved and
> +stored into the MMIO spte. Thus, the MMIO spte might be created based on
> +out-of-date information, but with an up-to-date generation number.
> +
> +To avoid this, the generation number is incremented again after synchronize_srcu
> +returns; thus, the low bit of kvm_memslots(kvm)->generation is only 1 during a
> +memslot update, while some SRCU readers might be using the old copy. We do not
> +want to use an MMIO sptes created with an odd generation number, and we can do
> +this without losing a bit in the MMIO spte. The low bit of the generation
> +is not stored in MMIO spte, and presumed zero when it is extracted out of the
> +spte. If KVM is unlucky and creates an MMIO spte while the low bit is 1,
> +the next access to the spte will always be a cache miss.
> +
>
> Further reading
> ===============
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 323c3f5f5c84..96515957ba82 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -199,16 +199,20 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
> EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>
> /*
> - * spte bits of bit 3 ~ bit 11 are used as low 9 bits of generation number,
> - * the bits of bits 52 ~ bit 61 are used as high 10 bits of generation
> - * number.
> + * the low bit of the generation number is always presumed to be zero.
> + * This disables mmio caching during memslot updates. The concept is
> + * similar to a seqcount but instead of retrying the access we just punt
> + * and ignore the cache.
> + *
> + * spte bits 3-11 are used as bits 1-9 of the generation number,
> + * the bits 52-61 are used as bits 10-19 of the generation number.
> */
> -#define MMIO_SPTE_GEN_LOW_SHIFT 3
> +#define MMIO_SPTE_GEN_LOW_SHIFT 2
> #define MMIO_SPTE_GEN_HIGH_SHIFT 52
>
> -#define MMIO_GEN_SHIFT 19
> -#define MMIO_GEN_LOW_SHIFT 9
> -#define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 1)
> +#define MMIO_GEN_SHIFT 20
> +#define MMIO_GEN_LOW_SHIFT 10
> +#define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 2)
> #define MMIO_GEN_MASK ((1 << MMIO_GEN_SHIFT) - 1)
> #define MMIO_MAX_GEN ((1 << MMIO_GEN_SHIFT) - 1)
>
> @@ -4428,7 +4432,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
> * The very rare case: if the generation-number is round,
> * zap all shadow pages.
> */
> - if (unlikely(kvm_current_mmio_generation(kvm) >= MMIO_MAX_GEN)) {
> + if (unlikely(kvm_current_mmio_generation(kvm) == 0)) {

This should be in patch 1/3.

> printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for mmio generation wraparound\n");
> kvm_mmu_invalidate_zap_all_pages(kvm);
> }
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0bfdb673db26..bb8641b5d83b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -95,8 +95,6 @@ static int hardware_enable_all(void);
> static void hardware_disable_all(void);
>
> static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
> -static void update_memslots(struct kvm_memslots *slots,
> - struct kvm_memory_slot *new, u64 last_generation);
>
> static void kvm_release_pfn_dirty(pfn_t pfn);
> static void mark_page_dirty_in_slot(struct kvm *kvm,
> @@ -695,8 +693,7 @@ static void sort_memslots(struct kvm_memslots *slots)
> }
>
> static void update_memslots(struct kvm_memslots *slots,
> - struct kvm_memory_slot *new,
> - u64 last_generation)
> + struct kvm_memory_slot *new)
> {
> if (new) {
> int id = new->id;
> @@ -707,8 +704,6 @@ static void update_memslots(struct kvm_memslots *slots,
> if (new->npages != npages)
> sort_memslots(slots);
> }
> -
> - slots->generation = last_generation + 1;
> }
>
> static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
> @@ -730,10 +725,24 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
> {
> struct kvm_memslots *old_memslots = kvm->memslots;
>
> - update_memslots(slots, new, kvm->memslots->generation);
> + /*
> + * Set the low bit in the generation, which disables SPTE caching
> + * until the end of synchronize_srcu_expedited.
> + */
> + WARN_ON(old_memslots->generation & 1);
> + slots->generation = old_memslots->generation + 1;
> +
> + update_memslots(slots, new);
> rcu_assign_pointer(kvm->memslots, slots);
> synchronize_srcu_expedited(&kvm->srcu);
>
> + /*
> + * Increment the new memslot generation a second time. This prevents
> + * vm exits that race with memslot updates from caching a memslot
> + * generation that will (potentially) be valid forever.
> + */
> + slots->generation++;
> +
> kvm_arch_memslots_updated(kvm);
>
> return old_memslots;
> --
> 1.8.3.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/