Re: recalibrating x86 TSC during suspend/resume

From: Thomas Gleixner
Date: Fri Feb 22 2019 - 06:44:47 EST


On Fri, 22 Feb 2019, Olaf Hering wrote:
> Is there a way to recalibrate the x86 TSC during a suspend/resume cycle?

No.

> While the frequency will remain the same on a Laptop, it may (or rather:
> it definitly will) differ if a VM is migrated from one host to another.
> The hypervisor may choose to emulate the expected TSC frequency on the
> destination host, but this emulation comes with a significant
> performance cost. Therefore it would be good if the kernel evaluates the
> environment during resume.
>
> The specific usecase I have is a workload within VMs that makes heavy
> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
> because only this clocksource gives enough granularity. The default
> paravirtualized clock will return the same values via
> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
> short. This does not happen with 'clocksource=tsc'.
>
> Right now it is not possible to migrate VMs to hosts with different CPU
> speeds. This leads to "islands" of identical hardware, and makes
> maintenance of hosts harder than it needs to be. If the VM kernel would
> be able to cope with CPU/TSC frequency changes, the pool of potential
> destination hosts will become significant larger.

The problem with recalibrating TSC on resume is that it would have to be

1) quick

2) accurate, so NTP does not get utterly unhappy.

Newer Intels support TSC scaling for VMX, which could solve the problem. It
affects TSC readout by:

TSC = (read(HWTSC) * multiplier) >> 48

So you can standarize on a TSC frequency accross a fleet. Not sure when
that was introduced and no idea whether it's available on AMD.

For a software solution we could try the following:

1) Provide the raw TSC frequency of the host to the guest in some magic
software defined MSR or CPUID. If there is an existing mechanism, use
that.

2) On resume check whether the MSR/CPUID is available and if so readout
that information and check whether the frequency is the same as
before. If not it is trivial enough to adjust the guest mult/shift
values for both raw and NTP adjusted clocks before they are used again,
i.e. before timekeeping_resume(). Need to look what's the best place,
but probably the clocksource resume callback. Plus if TSC deadline
timer is used, we'd need the same adjustment there.

That's backward compatible, because if the MSR/CPUID is not there, then
the recalibration is not tried.

Whether that is accurate enough or not to make NTP happy, I can't tell, but
it's definitely worth a try.

Thanks,

tglx