Re: [PATCH 1/2] KVM: fix cache stale memslot info with correct mmio generation number

From: Paolo Bonzini
Date: Mon Aug 18 2014 - 14:47:49 EST


Il 18/08/2014 18:35, Xiao Guangrong ha scritto:
>
> Hi Paolo,
>
> Thank you to review the patch!
>
> On Aug 18, 2014, at 9:57 PM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
>> Il 14/08/2014 09:01, Xiao Guangrong ha scritto:
>>> - update_memslots(slots, new, kvm->memslots->generation);
>>> + /* ensure generation number is always increased. */
>>> + slots->generation = old_memslots->generation;
>>> + update_memslots(slots, new);
>>> rcu_assign_pointer(kvm->memslots, slots);
>>> synchronize_srcu_expedited(&kvm->srcu);
>>> + slots->generation++;
>>
>> I don't trust my brain enough to review this patch.
>
> Sorry to make you confused. I should expain it more clearly.

Don't worry, it's not your fault. :)

>> kvm_current_mmio_generation seems like a very bad (race-prone) API. One
>> patch I trust myself reviewing would change a bunch of functions in
>> kvm_main.c to take a memslots struct. This would make it easy to
>> respect the hard and fast rule of not dereferencing the same pointer
>> twice. But it would be a tedious change.
>
> kvm_set_memory_region is the only place updating memslot and
> kvm_current_mmio_generation accesses memslot by rcu-dereference,
> i do not know why other places need to take into account.

The race occurs because gfn_to_pfn_many_atomic or some other function
has already used kvm_memslots(). Calling kvm_memslots() twice is the
root cause the bug.

> I think this patch is auditable, page-fault is always called by holding
> srcu-lock so that a page fault can’t go across synchronize_srcu_expedited.
> Only these cases can happen:
>
> 1) page fault occurs before synchronize_srcu_expedited.
> In this case, vcpu will generate mmio-exit for the memslot being registered
> by the ioctl. That’s ok since the ioctl have not finished.
>
> 2) page fault occurs after synchronize_srcu_expedited and during
> increasing generation-number.
> In this case, userspace may get wrong mmio-exit (that happen if handing
> page-fault is slower that the ioctl), that’s ok too since userspace need do
> the check anyway like i said above.
>
> 3) page fault occurs after generation-number update
> that’s definitely correct. :)
>
>> Another alternative could be to use the low bit to mark an in-progress
>> change, and skip the caching if the low bit is set. Similar to a
>> seqcount (except if read_seqcount_retry fails, we just punt and not
>> retry anything), you could use it even though the memory barriers
>> provided by write_seqcount_begin/end are not too useful in this case.
>
> I do not know how the bit works, page fault will cache the memslot before
> the bit set and cache the generation-number after the bit set.
>
> Maybe i missed your idea, could you please detail it?

Something like this:

- update_memslots(slots, new, kvm->memslots->generation);
+ /* ensure generation number is always increased. */
+ slots->generation = old_memslots->generation + 1;
+ update_memslots(slots, new);
rcu_assign_pointer(kvm->memslots, slots);
synchronize_srcu_expedited(&kvm->srcu);
+ slots->generation++;

Then case 1 and 2 will just have a cache miss.

The "low bit" is really just because each slot update does 2 generation
increases.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/