Re: pvclock time drifting backward
From: Ming Lin
Date: Fri Mar 28 2025 - 14:30:59 EST
On Thu, Mar 27, 2025 at 1:10 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>
> On Wed, 2025-03-26 at 08:54 -0700, Ming Lin wrote:
> > I applied the patch series on top of 6.9 cleanly and tested it with my
> > debug tool patch.
> > But it seems the time drift still increased monotonically.
> >
> > Would you help take a look if the tool patch makes sense?
> > https://github.com/minggr/linux/commit/5284a211b6bdc9f9041b669539558a6a858e88d0
> >
> > The tool patch adds a KVM debugfs entry to trigger time calculations
> > and print the results.
> > See my first email for more detail.
>
> Your first message seemed to say that the problem occurred with live
> migration. This message says "the time drift still increased
> monotonically".
Yes, we discovered this issue in our production environment, where the time
inside the guest OS slowed down by more than 2 seconds. This problem
occurred both during live upgrades locally and live migrations remotely.
However, the issue is only noticeable after the guest OS has been
running for a long time, typically over 30 days.
Since 30 days is too long to wait, I wrote a debugfs tool to quickly reproduce
the original issue, but now I'm not sure if the tool is working correctly.
>
> Trying to make sure I fully understand... the time drift between the
> host's CLOCK_MONOTONIC_RAW and the guest's kvmclock increases
> monotonically *but* the guest only observes the change when its
> master_kernel_ns/master_cycle_now are updated (e.g. on live migration)
> and its kvmclock is reset back to the host's CLOCK_MONOTONIC_RAW?
Yes, we are using the 5.4 kernel and have verified that the guest OS time
remains correct after live upgrades/migrations, as long as
master_kernel_ns / master_cycle_now are not updated
(i.e., if the old master_kernel_ns / master_cycle_now values are retained).
>
> Is this live migration from one VMM to another on the same host, so we
> don't have to worry about the accuracy of the TSC itself? The guest TSC
> remains consistent? And presumably your host does *have* a stable TSC,
> and the guest's test case really ought to be checking the
> PVCLOCK_TSC_STABLE_BIT to make sure of that?
The live migration is from one VMM to another on a remote host, and we
have also observed the same issue during live upgrades on the same host.
>
> If all the above assumptions/interpretations of mine are true, I still
> think it's expected that your clock will jump on live migration
> *unless* you also taught your VMM to use the new KVM_[GS]ET_CLOCK_GUEST
> ioctls which were added in my patch series, specifically to preserve
> the mathematical relationship between guest TSC and kvmclock across a
> migration.
>
We are planning to test the patches on a 6.9 kernel (where they can be
applied cleanly) and modify the live upgrade/migration code to use the new
KVM_[GS]ET_CLOCK_GUEST ioctls.
BTW, what is the plan for upstreaming these patches?
Thanks,
Ming