Re: [RFC PATCH v4] ptp: Add vDSO-style vmclock support

From: David Woodhouse
Date: Wed Jul 10 2024 - 12:01:57 EST


On Wed, 2024-07-10 at 15:07 +0200, Peter Hilber wrote:
> On 08.07.24 11:27, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@xxxxxxxxxxxx>
> >
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> >
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> >
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> >
> > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
>
> [...]
>
> > +
> > +struct vmclock_abi {
> > +       /* CONSTANT FIELDS */
> > +       uint32_t magic;
> > +#define VMCLOCK_MAGIC  0x4b4c4356 /* "VCLK" */
> > +       uint32_t size;          /* Size of region containing this structure */
> > +       uint16_t version;       /* 1 */
> > +       uint8_t counter_id; /* Matches VIRTIO_RTC_COUNTER_xxx except INVALID */
> > +#define VMCLOCK_COUNTER_ARM_VCNT       0
> > +#define VMCLOCK_COUNTER_X86_TSC                1
> > +#define VMCLOCK_COUNTER_INVALID                0xff
> > +       uint8_t time_type; /* Matches VIRTIO_RTC_TYPE_xxx */
> > +#define VMCLOCK_TIME_UTC                       0       /* Since 1970-01-01 00:00:00z */
> > +#define VMCLOCK_TIME_TAI                       1       /* Since 1970-01-01 00:00:00z */
> > +#define VMCLOCK_TIME_MONOTONIC                 2       /* Since undefined epoch */
> > +#define VMCLOCK_TIME_INVALID_SMEARED           3       /* Not supported */
> > +#define VMCLOCK_TIME_INVALID_MAYBE_SMEARED     4       /* Not supported */
> > +
> > +       /* NON-CONSTANT FIELDS PROTECTED BY SEQCOUNT LOCK */
> > +       uint32_t seq_count;     /* Low bit means an update is in progress */
> > +       /*
> > +        * This field changes to another non-repeating value when the CPU
> > +        * counter is disrupted, for example on live migration. This lets
> > +        * the guest know that it should discard any calibration it has
> > +        * performed of the counter against external sources (NTP/PTP/etc.).
> > +        */
> > +       uint64_t disruption_marker;
> > +       uint64_t flags;
> > +       /* Indicates that the tai_offset_sec field is valid */
> > +#define VMCLOCK_FLAG_TAI_OFFSET_VALID          (1 << 0)
> > +       /*
> > +        * Optionally used to notify guests of pending maintenance events.
> > +        * A guest which provides latency-sensitive services may wish to
> > +        * remove itself from service if an event is coming up. Two flags
> > +        * indicate the approximate imminence of the event.
> > +        */
> > +#define VMCLOCK_FLAG_DISRUPTION_SOON           (1 << 1) /* About a day */
> > +#define VMCLOCK_FLAG_DISRUPTION_IMMINENT       (1 << 2) /* About an hour */
> > +#define VMCLOCK_FLAG_PERIOD_ESTERROR_VALID     (1 << 3)
> > +#define VMCLOCK_FLAG_PERIOD_MAXERROR_VALID     (1 << 4)
> > +#define VMCLOCK_FLAG_TIME_ESTERROR_VALID       (1 << 5)
> > +#define VMCLOCK_FLAG_TIME_MAXERROR_VALID       (1 << 6)
> > +       /*
> > +        * Even regardless of leap seconds, the time presented through this
> > +        * mechanism may not be strictly monotonic. If the counter slows down
> > +        * and the host adapts to this discovery, the time calculated from
> > +        * the value of the counter immediately after an update to this
> > +        * structure, may appear to be *earlier* than a calculation just
> > +        * before the update (while the counter was believed to be running
> > +        * faster than it now is). A guest operating system will typically
> > +        * *skew* its own system clock back towards the reference clock
> > +        * exposed here, rather than following this clock directly. If,
> > +        * however, this structure is being populated from such a system
> > +        * clock which is already handled in such a fashion and the results
> > +        * *are* guaranteed to be monotonic, such monotonicity can be
> > +        * advertised by setting this bit.
> > +        */
>
> I wonder if this might be difficult to define in a standard.

I'm sure we could do better than my attempt above, but surely it isn't
*so* hard to define monotonicity?

> Is there a need to define device and driver behavior in more detail? What
> would happen if e.g. the device first decides how to update the clock, but
> is then slow to update the SHM?

That's OK, isn't it?

The key in the VMCLOCK_FLAG_TIME_MONOTONIC flag is that by setting it,
the host guarantees that the time calculated according to this
structure at any given moment, shall not appear to be later than the
time calculated via the structure at any *later* moment.

The kernel typically takes create care to ensure that time as seen by
userspace (e.g. in the real vDSO) *is* monotonic. If the underlying
counter is speeding up and the relationship from counter to real time
is being adjusted, the kernel is careful to ensure that the latest time
which can be calculated from one version of the data cannot appear
later than the earliest time which can be calculated from the next
version.

However, that naturally means a loss of precision, as if the counter
has been running faster than expected, the apparent time will be later
than the true time, and the kernel has to *overcompensate* for the rate
increase, running the apparent clock slower than the true counter
period until the apparent time converges with true time again.

The time exposed through this structure can be as *precise* as
possible, or it can guarantee monotonicity. This flag allows the host
to advertise the latter. It can be useful for the guest to know,
because a non-monotonic clock is less suitable for *direct* use by
client applications, and more suitable for feeding the guest kernel's
own timekeeping.

> > +#define VMCLOCK_FLAG_TIME_MONOTONIC            (1 << 7)
> > +
> > +       uint8_t pad[2];
> > +       uint8_t clock_status;
> > +#define VMCLOCK_STATUS_UNKNOWN         0
> > +#define VMCLOCK_STATUS_INITIALIZING    1
> > +#define VMCLOCK_STATUS_SYNCHRONIZED    2
> > +#define VMCLOCK_STATUS_FREERUNNING     3
> > +#define VMCLOCK_STATUS_UNRELIABLE      4
> > +
> > +       /*
> > +        * The time exposed through this device is never smeared. This field
> > +        * corresponds to the 'subtype' field in virtio-rtc, which indicates
> > +        * the smearing method. However in this case it provides a *hint* to
> > +        * the guest operating system, such that *if* the guest OS wants to
> > +        * provide its users with an alternative clock which does not follow
> > +        * the POSIX CLOCK_REALTIME standard, it may do so in a fashion
> > +        * consistent with the other systems in the nearby environment.
>
> AFAIU the POSIX.1-2017 standard does not mandate UTC, esp. not w.r.t.
> leap seconds [1, A.4.16 Seconds Since the Epoch]:
>
> > Those applications which do care about leap seconds can determine how to
> > handle them in whatever way those applications feel is best. This was
> > particularly emphasized because there was disagreement about what the best
> > way of handling leap seconds might be. It is a practical impossibility to
> > mandate that a conforming implementation must have a fixed relationship to
> > any particular official clock (consider isolated systems, or systems
> > performing "reruns" by setting the clock to some arbitrary time).
>
> So the above comment should probably refer to UTC instead of POSIX
> CLOCK_REALTIME.

Ack, I'll fix that; thanks.

> > +        */
> > +       uint8_t leap_second_smearing_hint; /* Matches VIRTIO_RTC_SUBTYPE_xxx */
> > +#define VMCLOCK_SMEARING_STRICT                0
> > +#define VMCLOCK_SMEARING_NOON_LINEAR   1
> > +#define VMCLOCK_SMEARING_UTC_SLS       2
> > +       int16_t tai_offset_sec;
> > +       uint8_t leap_indicator; /* Based on VIRTIO_RTC_LEAP_xxx */
> > +#define VMCLOCK_LEAP_NONE      0       /* No known nearby leap second */
> > +#define VMCLOCK_LEAP_PRE_POS   1       /* Leap second + at end of month */
>
> A positive leap second usually means stepping the clock backwards, so
> `Leap second +` is somewhat confusing.

I was trying to avoid hitting 80 characters. I'll rework it; thanks.

> > +#define VMCLOCK_LEAP_PRE_NEG   2       /* Leap second - at end of month */
> > +#define VMCLOCK_LEAP_POS       3       /* Set during 23:59:60 second */
> > +#define VMCLOCK_LEAP_NEG       4       /* Not used in VMCLOCK */
> > +       /*
> > +        * These values are not (yet) in virtio-rtc. They indicate that a
> > +        * leap second *has* occurred at the start of the month. This allows
> > +        * a guest to generate a smeared clock from the accurate clock which
> > +        * this device provides, as smearing may need to continue for up to a
> > +        * period of time *after* the point of the leap second itself. Must
> > +        * be cleared by the 15th day of the month.
> > +        */
> > +#define VMCLOCK_LEAP_POST_POS  5
> > +#define VMCLOCK_LEAP_POST_NEG  6
>
> I think it can still be discussed in the context of virtio-rtc whether we
> should add dedicated identifiers for message-based smeared clock readouts.

Yeah, I think the only thing we really care about for SHM is the POST
values I added there. As long as you'll let me have those, I think the
structure is basically ready to be seriously proposed as an addition to

Attachment: smime.p7s
Description: S/MIME cryptographic signature