RE: [PATCH v2] clocksource: hyper-v: Fix hv tsc page based sched_clock for hibernation

From: Michael Kelley
Date: Thu Sep 12 2024 - 11:21:26 EST


From: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> Sent: Wednesday, September 11, 2024 9:51 PM
>
> On 9/12/2024 9:09 AM, Michael Kelley wrote:
> > From: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> Sent: Tuesday, September 10,
> 2024 9:57 PM
> >>
> >
> > This version of the patch looks good to me from the standpoint of
> > separating the x86 specific functionality from the arch independent
> > functionality. And I think the patch works as intended. But there
> > are parts of the description and variable naming that don't align
> > with my understanding of the problem and the fix. So I've added
> > some additional comments below.
> >
> > Nit: Now that most of the code changes are in mshyperv.c, the
> > patch Subject: prefix should perhaps be "x86/hyperv:" instead
> > of "clocksource: hyperv:".
>
> Thanks a lot for reviewing Michael. As you rightly pointed out, these
> comments and variable names made more sense when they were in
> hyperv_timer.c. I'll change them accordingly in next patch.
>
> Will change commit msg subject as well.
>
> >
> >> read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is
> >> bigger than the variable hv_sched_clock_offset, which is cached during
> >> early boot, but depending on the timing this assumption may be false
> >> when a hibernated VM starts again (the clock counter starts from 0
> >> again) and is resuming back (Note: hv_init_tsc_clocksource() is not
> >> called during hibernation/resume); consequently,
> >> read_hv_sched_clock_tsc() may return a negative integer (which is
> >> interpreted as a huge positive integer since the return type is u64)
> >> and new kernel messages are prefixed with huge timestamps before
> >> read_hv_sched_clock_tsc() grows big enough (which typically takes
> >> several seconds).
> >
> > Just so I'm clear on the sequence, when a new VM is created to
> > resume the hibernated VM, I think the following happens:
> >
> > 1) The VM being used to resume the hibernation image boots a
> > fresh instance of the Linux kernel. The sched clock and sched clock
> > offset value are initialized as with any kernel, and kernel messages
> > are printed with the correct timestamps starting at zero.
> >
> > 2) The new Linux kernel then loads the hibernation image and
> > transfers control to it, whereupon the "resume" callbacks are run
> > in the context of the hibernation image. At this point, any kernel
> > timestamps are wrong, and might even be negative, because the
> > sched clock value is calculated based on the new Hyper-V reference
> > time (which started again at zero) minus the old sched clock offset.
> > The goal is that the sched clock value should be continuous with
> > the sched clock value from the original VM. If the original VM
> > had been running for 1000 seconds when the hibernation was
> > done, the sched clock value in the resumed hibernation image
> > should continue, starting at ~1000 seconds.
> >
> > 3) The fix is to adjust the sched clock offset in the resumed
> > hibernation image, and make it more negative by that ~1000
> > seconds.
> >
> > Is that all correct? If so, then it seems like this patch is doing
> > more than just cleaning up the negative values for sched clock.
> > It's also making the sched clock values continuous with the
> > sched clock values in the original VM rather than restarting
> > near zero after hibernation image is resumed.
> >
>
> Yes, that's exactly what this patch is trying to do. There was an option
> to correct in suspend-resume callbacks of original VM in hyperv_timer.c,
> but these are executed very late, and we still end up getting many logs
> with these incorrect timestamps. We took reference from the code where
> tsc clock correction takes place, and thought that similar should be
> done here.

Yes, agreed. I'm glad the mechanism for the TSC clock correction
is available to use. :-)

>
> We can tweak the commit msg to add this additional detail.

Thanks. I think the change to make the sched clock time (and
therefore the dmesg log timestamps) continuous with the
original VM is important to call out. It's a change that will be
visible to users.

[snip]

> >> +/*
> >> + * Hyper-V clock counter resets during hibernation. Save and restore clock
> >> + * offset during suspend/resume, while also considering the time passed
> >> + * before suspend. This is to make sure that sched_clock using hv tsc page
> >> + * based clocksource, proceeds from where it left off during suspend and
> >> + * it shows correct time for the timestamps of kernel messages after resume.
> >> + */
>
> I added it here, but the same should be added in commit msg as well.
> I'll add it.

Ah, indeed, you did add it in the comment. But yes, it should go in
the commit message as well.

>
> >> +static void save_hv_clock_tsc_state(void)
> >> +{
> >> + hv_sched_clock_offset_saved = hv_read_reference_counter();
> >
> > Naming this variable hv_sched_clock_offset_saved doesn't seem to match
> > what it actually contains. The saved value is not a sched_clock_offset. It's
> > the value of the Hyper-V reference counter at the time the original VM
> > hibernates does "suspend". The sched_clock_offset in the original VM will
> > typically be a pretty small value (a few seconds or even less). But the
> > Hyper-V reference counter value might be thousands of seconds if the
> > VM has been running a while before it hibernates.
>
> I'll change it to something that conveys the right information. Thanks
> for the suggestion.
>
> >
> >> +}
> >> +
> >> +static void restore_hv_clock_tsc_state(void)
> >> +{
> >> + /*
> >> + * hv_sched_clock_offset = offset that is used by hyperv_timer clocksource driver
> >> + * to get time.
> >> + * Time passed before suspend = hv_sched_clock_offset_saved
> >> + * - hv_sched_clock_offset (old)
> >> + *
> >> + * After Hyper-V clock counter resets, hv_sched_clock_offset needs a correction.
> >> + *
> >> + * New time = hv_read_reference_counter() (future) - hv_sched_clock_offset (new)
> >> + * New time = Time passed before suspend + hv_read_reference_counter() (future)
> >> + * - hv_read_reference_counter() (now)
> >> + *
> >> + * Solving the above two equations gives:
> >> + *
> >> + * hv_sched_clock_offset (new) = hv_sched_clock_offset (old)
> >> + * - hv_sched_clock_offset_saved
> >> + * + hv_read_reference_counter() (now))
> >> + */
> >> + hv_adj_sched_clock_offset(hv_sched_clock_offset_saved - hv_read_reference_counter());
> >
> > The argument passed to hv_adj_sched_clock_offset() makes sense to me if I think
> > of it as:
> >
> > hv_ref_time_at_hibernate - hv_read_reference_counter()
> >
> > where hv_read_reference_counter() is just "ref time now".
> >
> > I think of it like this: The Hyper-V reference counter value changed underneath
> > the resumed hibernation image when it starts running in the new VM. The adjustment
> > changes the sched clock offset to compensate for that change so that sched clock
> > values are continuous across the suspend/resume hibernation sequence.
> >
> > I don't completely understand what you've explained with the two equations and
> > solving them, though the result matches my expectations.
>
> Yeah :) it made more sense when we look at it from hyperv_timer.c driver
> POV because these offsets are nothing but reference counters at
> different points of time.

Well, yes and no. The value of the Hyper-V reference counter is
an absolute value at the time it is read. But the hv_sched_clock_offset
is a "delta" value -- the difference between two absolute time values.
While hv_sched_clock_offset is initially set to the current value of the
Hyper-V reference counter, it is used as a delta value. And after
the adjustment is applied when resuming from hibernation,
hv_sched_clock_offset is definitely no longer a reference counter
value from some point in time -- it's clearly only a delta value.

> Having said that, I think we can go with a
> comment explaining the intention, and skip adding these equations which
> may be confusing here as there is no concept of offsets here, as you
> rightly pointed out in your previous reply as well.

Works for me.

Thanks,

Michael