Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
From: Andy Lutomirski
Date: Tue Jan 06 2015 - 16:54:35 EST
On Tue, Jan 6, 2015 at 12:20 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
>> > What is the point with the new flags bit though?
>>
>> To try to work around the problem on old hosts. I'm not at all
>> convinced that this is worthwhile or that it helps, though.
>
> Don't think so. Just fix the host bug.
>
>> >> Also, if you do this, can you also make setting and clearing
>> >> STABLE_BIT properly atomic across all vCPUs? Or at least do something
>> >> like setting it last and clearing it first on vPCU 0?
>> >
>> > If the version "seqlock" works properly across vCPUs, why do you need
>> > STABLE_BIT "properly atomic" ?
>> >
>> > Please define what you mean by "properly atomic".
>> >
>>
>> I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
>> in the vdso if the stable bit is set. That means that the host should
>> avoid doing things like migrating the guest, clearing the stable bit
>> for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
>> stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
>> and return a bad timestamp.
>>
>> Maybe this scenario is impossible, but getting rid of any getcpu-like
>> operation in the vdso has really nice benefits.
>
> You can park every vCPU in host while updating vCPU-0's timestamp.
>
> See kvm_gen_update_masterclock:
>
> + /* no guest entries from this point */
> + pvclock_update_vm_gtod_copy(kvm);
>
> - touch guest memory
>
> + /* guest entries allowed */
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
>
Can we do that easily? It looks like we're holding a spinlock in
there. Could we make pvclock_gtod_sync_lock into a mutex?
We could also add something to explicitly prevent any of the guests
from entering until we're updated all of them, but that would hurt
performance even more. It would be kind of nice if we could avoid
serializing all CPUs entirely, though. For example, if we could
increment all the versions, then write all the pvtis, then increment
all the versions again from a single function, then everything is
atomic for a correctly behaving guest, but if the guest isn't actually
reading the time, then it doesn't stall.
(Also, we should really have a cpu_relax in all of these loops, IMO,
so that pause loop exiting can take effect.)
--Andy
>> It's faster and it
>> lets us guarantee that the vdso's pvti data fits in a single page.
>> The latter means that we can easily make it work like the hpet
>> mapping, which gets us 32-bit support and will *finally* let us turn
>> off user access to the fixmap if vsyscall=none.
>>
>> (We can, of course, still do this if the pvti data needs to be an
>> array, but it's messier.)
>>
>> --Andy
--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/