On Tue, 2023-10-10 at 09:40 +0000, Paul Durrant wrote:
From: Paul Durrant <pdurrant@xxxxxxxxxx>
Unless explicitly told to do so (by passing 'clocksource=tsc' and
'tsc=stable:socket', and then jumping through some hoops concerning
potential CPU hotplug) Xen will never use TSC as its clocksource.
Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set
in either the primary or secondary pvclock memory areas. This has
led to bugs in some guest kernels which only become evident if
PVCLOCK_TSC_STABLE_BIT *is* set in the pvclock.
Specifically, some OL7 kernels backported the whole pvclock vDSO thing
but *forgot* https://git.kernel.org/torvalds/c/9f08890ab and thus kill
init with a SIGBUS the first time it tries to read a clock, because
they don't actually map the pvclock pages to userspace :)
They apparently never noticed because evidently *their* Xen fleet
doesn't actually jump through all those hoops to use the TSC as its
clocksource either.
It's a fairly safe bet that there are more broken guest kernels out
there too, hence needing to work around it.
Hence, to support
such guests, give the VMM a new attribute to tell KVM to forcibly
clear the bit in the Xen pvclocks.
I frowned at the "PVCLOCK" part of the new attribute for a while,
thinking that perhaps if we're going to have a set of flags to tweak
behaviour, we shouldn't be so specific. Call it 'XEN_FEATURES' or
something... but then I realised we'd want to *advertise* the set of
bits which is available for userspace to set...
... and then I realised we already do. That's exactly what the set of
bits returned, and *set*, with KVM_CAP_XEN_HVM is for.
So let's ditch the new *attribute*, and just add your new (renamed)
KVM_XEN_HVM_CONFIG_PVCLOCK_NO_STABLE_TSC cap to the set of
permitted_flags in kvm_xen_hvm_config() so that userspace can enable it
that way like it does the INTERCEPT_HYPERCALL and EVTCHN_SEND
behaviours.