Re: [PATCH net-next v8 3/3] gve: implement PTP gettimex64

From: Jordan Rhee

Date: Wed May 27 2026 - 14:48:08 EST

Hello,

We would like to request the community's guidance on how to proceed with adding
cross-timestamp functionality to GVE. The key challenge is that the
NIC clock can only be sampled in the hypervisor because the NIC time
registers are only present on the PF, and a hypercall takes 60-80us.
To minimize the uncertainty window, the hypervisor takes a TSC
sandwich around the reading with an uncertainty window of ~1.2us,
converts it to guest TSC, and returns it to the guest.

The goals which led to this implementation were to minimize the
uncertainty window and communicate the uncertainty window up the
stack. I've reviewed
https://lore.kernel.org/netdev/20260526165826.392227559@xxxxxxxxxx/
but I'm not sure how we would use it, so if it is intended to support
our use case I would appreciate if someone could explain how we
should use it.

Here are some options on how to proceed, but other ideas are welcome.

1) resubmit the patch with the same high level approach with the
feedback addressed. This is our preference since it achieves the smallest
possible uncertainty window and communicates it up the stack.
- move system timestamp sampling and ktime_get_snapshot()
inside the mutex
- remove get_cycles() and the ordering sanity check
2) implement getcrosstimestamp() and return the midpoint of pre/post,
but lose the uncertainty window
3) wait for https://lore.kernel.org/all/20260515164033.6403-1-akiyano@xxxxxxxxxx/
which adds error_bound to getcrosstimestamp()
- error_bound calculation: convert TSC delta to nsec using
tsc_khz/arch_timer_get_cntfrq() => clocks_calc_mult_shift() =>
clocksource_cyc2ns()
- since this patch changes usermode interface, chrony/phc2sys would
need changes to consume the new IOCTLs, and it is unclear whether
there is an intention to do so
4) implement gettimex64() using only system timestamps sampled in the
VM (i.e. remove the precise path). Accept uncertainty window of
60-80us vs. 1-2us
5) implement gettimex64() using the midpoint of pre/post to make a
single call to get_device_system_crosststamp(), then compute the
pre/post ktimes by adding/subtracting the delta/2 between the TSC
readings from the firmware

Thank you,
Jordan

On Wed, May 20, 2026 at 9:35 PM Jordan Rhee <jordanrhee@xxxxxxxxxx> wrote:
>
> Hi Thomas,
> Thank you for the review. We plan to rework this to use
> https://lore.kernel.org/all/20260515164033.6403-1-akiyano@xxxxxxxxxx/,
> which will let us use the midpoint of pre and post while still
> propagating the uncertainty window up the stack. I am going to address
> some of the concerns you raised simply for the record, not because I
> am trying to convince you one way or another. I do not expect a
> response. Thank you,
> Jordan
>
> On Wed, May 20, 2026 at 2:08 PM Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
> >
> > On Thu, May 14 2026 at 22:58, Harshitha Ramamurthy wrote:
> > > From: Jordan Rhee <jordanrhee@xxxxxxxxxx>
> > >
> > > Enable chrony and phc2sys to synchronize system clock to NIC clock.
> > >
> > > Two paths are implemented: a precise path using system counter values
> > > sampled by the device, and a fallback path using system counter values
> > > sampled in the driver using ptp_read_system_prets()/postts().
> > >
> > > To use the precise path, the current system clocksource must match the
> > > units returned by the device, which on x86 is X86_TSC and on ARM64 is
> > > ARM_ARCH_COUNTER. The clockid requested for the cross-timestamp must
> > > be either CLOCK_REALTIME or CLOCK_MONOTONIC_RAW. These conditions hold
> > > by default on GCP VMs using Chrony, so we expect the precise path to be
> > > used the vast majority of the time. If the system clocksource is changed
> > > to kvm-clock, it activates the fallback path. Ethtool counters have been
> > > added to count how many times each path is used.
> > >
> > > The uncertainty window in the precise path is typically around 1-2us,
> > > while in the fallback path is around 60-80us.
> > >
> > > Stub implementions of adjfine and adjtime are added to avoid NULL
> > > dereference when phc2sys tries to adjust the clock.
> >
> > Sorry that I did not react to this earlier, but I was burried in
> > regressions and allowed myself to take a few days off.
> >
> > > +/*
> > > + * Read NIC clock by issuing the AQ command. The command is subject to
> > > + * rate limiting and may need to be retried. Requires nic_ts_read_lock
> > > + * to be held.
> > > + */
> > > +static int gve_ptp_read_timestamp(struct gve_ptp *ptp, cycles_t *pre_cycles,
> > > + cycles_t *post_cycles)
> > > +{
> > > + unsigned long deadline = jiffies + msecs_to_jiffies(100);
> > > + unsigned long delay_us = 1000;
> > > + int err;
> > > +
> > > + lockdep_assert_held(&ptp->nic_ts_read_lock);
> > > +
> > > + do {
> > > + *pre_cycles = get_cycles();
> >
> > This is a completely non-sensical hack. Why do you need get_cycles()
> > here, which is a horrible and ill-defined interface that others are
> > trying to get rid of?
>
> We use get_cycles() purely as a sanity check against the TSC values
> returned by firmware. We do not use them in the time conversion.
> This check found a serious correctness bug in the firmware.
>
> >
> > Just because it's still there is not a good answer and it's actually not
> > required at all. See below.
> >
> > > + err = gve_adminq_report_nic_ts(ptp->priv,
> > > + ptp->nic_ts_report_bus);
> > > +
> > > + /* Prevent get_cycles() from being speculatively executed
> > > + * before the AdminQ command
> > > + */
> > > + rmb();
> > > + *post_cycles = get_cycles();
> > > + if (likely(err != -EAGAIN))
> > > + return err;
> > > +
> > > + fsleep(delay_us);
> > > +
> > > + /* Exponential backoff */
> > > + delay_us *= 2;
> > > + } while (time_before(jiffies, deadline));
> > > +
> > > + return -ETIMEDOUT;
> > > +}
> > > +
> > > /* Read the nic timestamp from hardware via the admin queue. */
> > > -static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
> > > +static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw,
> > > + struct gve_sysclock_sample *sysclock)
> > > {
> > > + cycles_t host_pre_cycles, host_post_cycles;
> > > + struct gve_nic_ts_report *ts_report;
> > > int err;
> > >
> > > mutex_lock(&ptp->nic_ts_read_lock);
> >
> > Why is this lock taken _after_ the regular (fallback) pre/post
> > timestamps are aquired?
> >
> > What's wrong with doing:
> >
> > do {
> > guard(mutex)(&ptp->nic_ts_read_lock);
> >
> > ptp_read_system_prets(sts);
> > err = read_nic_timestamp(...);
> > ptp_read_system_postts(sts);
> > } while (err && !timeout(...));
> >
> > return err;
> >
> > which moves the "fallback" timestamp right next to the actual hardware
> > readout?
>
> The lock is taken at the smallest scope possible, which does not
> include the pre/post TS reading. This routine is called by multiple
> callers, not all of which need the pre/post timestamp. Options are to:
> - move the pre/post TS reading here and add branches to conditionally
> sample them only when needed
> - always capture the pre/post TS readings and discard them when not needed
> - duplicate the locking logic, once for path that needs pre/post TS,
> once for path that doesn't need pre/post TS
>
> Since the fallback path is expected to be rarely used, lock contention
> and retries are expected to be extremely rare, and chrony is resilient
> to samples with high uncertainty window, I chose to prioritize
> cleanliness of the code and keep the pre/post TS reading in the caller
> where it is actually needed. That said, I do not feel strongly about this
> and would be open to moving pre/post inside the mutex if I were sending
> another revision.
>
> >
> > That would make the "fallback" case too precise, right?
>
> Moving pre/post inside the mutex makes no measurable difference
> because the lock is rarely contended. There is 4 orders of
> magnitude difference in RMS offset between precise and fallback paths.
>
> >
> > After noticing the obvious, I just checked what our new AI review
> > overlord "sashiko" has to say about this:
> >
> > "Does wrapping the timestamp read outside the locks and retry loops skew the
> > measurements?
> >
> > In the fallback path, ptp_read_system_prets() is captured before calling
> > gve_clock_nic_ts_read(), and postts() is captured after it returns. However,
> > gve_clock_nic_ts_read() acquires ptp->nic_ts_read_lock and calls
> > gve_ptp_read_timestamp(), which can sleep using fsleep() during retries. This
> > means the prets/postts window could include mutex wait times and sleep
> > durations."
> >
> > It _IS_ actually on spot. So why the heck did this garbage get merged
> > without addressing this obvious fail?
>
> - The fallback path is expected to be rarely used since it requires
> changing the default configuration
> - The lock is rarely contended as there are only 2 callers, one of
> which only executes once every 250ms,
> and retries would only occur if a VM exceeded their rate limits
> which have been tuned not to occur
> under normal chrony calling patterns.
>
> >
> > What "sashiko" actually saw aside of that, which I didn't see myself when
> > looking at this, is:
> >
> > "Similarly, in gve_ptp_read_timestamp(), the get_cycles() bounds are taken
> > outside the gve_adminq_report_nic_ts() call, which acquires the shared
> > priv->adminq_lock. If there is lock contention, the cycle bounds could become
> > extremely wide."
> >
> > Oh well.
>
> get_cycles() is only used for sanity checking the values from
> firmware, not in any time conversion.
>
> >
> > So what you really want is:
> >
> > do {
> > guard(mutex)(&ptp->nic_ts_read_lock);
> > guard(mutex)(&gpe_priv->adminq_lock);
> >
> > ptp_read_system_prets(sts);
> > err = read_nic_timestamp(...) {
> > ...
> > return gve_adminq_execute_cmd_locked(....);
> > }
> > ptp_read_system_postts(sts);
> > } while (err && !timeout(...));
> >
> > return err;
> >
> > Or just rely on gpe_priv->adminq_lock in the first place. No?
>
> adminq_lock is private to the adminq layer and protects adminq shared data
> structures. nic_ts_read_lock protects the global DMA buffer that receives
> the timestamps. Since retries are a response to rate limiting, we must wait
> between retries, so we do not want to hold the adminq lock across
> these retries as it would block other non rate-limited commands.
>
> >
> > But what's worse and escaped the AI scrutiny is:
> >
> > Your attempt to convert these 'get_cycles()' time stamps after the
> > fact and outside of the timekeeping core protection is fundamentally
> > wrong.
> >
> > get_device_system_crosststamp() does
> >
> > do {
> > seq = read_seqcount_begin(&tk_core.seq);
> > take_timestamps(...);
> > convert_timestamps(...)
> > } while (read_seqcount_retry(&tk_core.seq, seq));
> >
> > which covers the whole sequence of taking the snapshots from the
> > hardware and the conversion.
> >
> > Your abuse of this is completely out of spec. Just because it did not
> > blow up in your face yet does not make it in any way correct. You do not
> > even get bonus points for creative abuse.
>
> We use ktime_get_snapshot() to capture a historical snapshot before sampling
> the TSC, and we pass it to get_device_system_crosststamp() which
> uses the snapshot to interpolate the sample if it occurs before the current
> timekeeping interval. Perfectly within spec.
>
> >
> >
> > Now let me look at the "firmware" provided timestamps.
> >
> > That's the poor mans emulation of actual hardware timestamps aka PTM.
> >
> > Why do you need all this pre_nic/post_nic muck there? Fix your firmware
> > to actually honor PTM and not provide some made up version of it.
>
> This has to run on platforms that don't support PTM.
>
> >
> > Also the whole debug mechanism of comparing the time stamps of
> > get_cycles() with the timestamps provided by firmware tells me that your
> > firmware is a hacked together piece of garbage.
> >
> > If you want to use "hardware" timestamps then populate the
> > ptp.info.getcrosststamp() callback.
> >
> > If your firmware really can't do PTM and only provides pre and post
> > values, then you simply can take the midpoint in your getcrosststamp()
> > callback:
> >
> > do {
> > guard(mutex)(&ptp->nic_ts_read_lock);
> > err = read_nic_timestamp(...);
> > system_counterval->cycles = (nic_pre + nic_post) / 2;
> > } while (err && !timeout(...));
>
> We considered returning the midpoint of pre/post up through
> getcrosststamp(), but there was
> no way to communicate the uncertainty interval which is important
> for some of our use cases.
> https://lore.kernel.org/all/20260515164033.6403-1-akiyano@xxxxxxxxxx/
> fixes that issue, so we can use that when it lands.
>
> >
> > If you do not even trust your firmware then you still can validate it
> > easily without creating horrible hacks:
> >
> > do {
> > guard(mutex)(...);
> >
> > ptp_read_system_prets(&ctx->sts);
> > err = read_nic_timestamp(...);
> > system_counterval->cycles = (nic_pre + nic_post) / 2;
> > ptp_read_system_postts(&ctx->sts);
> > } while (err && !timeout(...));
> >
> > and then check the ctx->sts timestamps against the results provided by
> > get_device_system_crosststamp() after conversion in your
> > ptp.info.getcrosststamp() callback:
> >
> > err = get_device_system_crosststamp(..., &ctx);
> > if (err)
> > return err;
> > if (ctx.sts.pre_ts > xtstamp.sys... || ctx.sts.post_ts < xtstamp.sys...)
> > return -EBROKENFIRMWARE;
> >
> > no?
>
> get_device_system_crosststamp() has nontrivial logic that operates on the system
> timestamps, so I am not confident that a bad TSC value would be detected
> if we ran it through get_device_system_crosststamp() first. The check
> is simplest
> and strongest when performed against the raw TSC values received from the
> device, before passing them to any kernel API. But I hear your point
> about mistrusting
> the firmware enough to need this check.
>
> >
> > Jakub, please back out this hackery before it makes a precedence for others
> > to copy from.
> >
> > Thanks,
> >
> > tglx