Re: [PATCH net-next v5 3/3] gve: implement PTP gettimex64

From: Jordan Rhee

Date: Tue May 05 2026 - 14:55:57 EST

On Tue, May 5, 2026 at 11:22 AM Jordan Rhee <jordanrhee@xxxxxxxxxx> wrote:
>
> On Tue, May 5, 2026 at 1:08 AM Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
> >
> > On 4/29/26 3:28 AM, Harshitha Ramamurthy wrote:
> > > From: Jordan Rhee <jordanrhee@xxxxxxxxxx>
> > >
> > > Enable chrony and phc2sys to synchronize system clock to NIC clock.
> > >
> > > The system cycle counters are sampled by the device to minimize the
> > > uncertainty window. If the system times are sampled in the host, the
> > > delta between pre and post readings is 100us or more due to AQ command
> > > latency. The system times returned by the device have a delta of ~1us,
> > > which enables significantly more accurate clock synchronization.
> > >
> > > Reviewed-by: Willem de Bruijn <willemb@xxxxxxxxxx>
> > > Reviewed-by: Kevin Yang <yyd@xxxxxxxxxx>
> > > Reviewed-by: Naman Gulati <namangulati@xxxxxxxxxx>
> > > Signed-off-by: Jordan Rhee <jordanrhee@xxxxxxxxxx>
> > > Signed-off-by: Harshitha Ramamurthy <hramamurthy@xxxxxxxxxx>
> > > ---
> > > Changes in v5:
> > > - Reformulate retry loop in terms of total timeout (Jakub Kicinski)
> > >
> > > Changes in v3:
> > > - Take system time snapshot inside the mutex
> > > - Return -EOPNOTSUPP if cross-timestamp is requested on an arch other
> > > than x86 or arm64
> > >
> > > Changes in v2:
> > > - fix compilation warning on ARM by casting cycles_t to u64
> > > ---
> > > drivers/net/ethernet/google/gve/gve_adminq.h | 4 +-
> > > drivers/net/ethernet/google/gve/gve_ptp.c | 196 ++++++++++++++++++-
> > > 2 files changed, 191 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
> > > index 22a74b6aa17e..e6dcf6da9091 100644
> > > --- a/drivers/net/ethernet/google/gve/gve_adminq.h
> > > +++ b/drivers/net/ethernet/google/gve/gve_adminq.h
> > > @@ -411,8 +411,8 @@ static_assert(sizeof(struct gve_adminq_report_nic_ts) == 16);
> > >
> > > struct gve_nic_ts_report {
> > > __be64 nic_timestamp; /* NIC clock in nanoseconds */
> > > - __be64 reserved1;
> > > - __be64 reserved2;
> > > + __be64 pre_cycles; /* System cycle counter before NIC clock read */
> > > + __be64 post_cycles; /* System cycle counter after NIC clock read */
> > > __be64 reserved3;
> > > __be64 reserved4;
> > > };
> > > diff --git a/drivers/net/ethernet/google/gve/gve_ptp.c b/drivers/net/ethernet/google/gve/gve_ptp.c
> > > index ad15f1209a83..c6c98ef825aa 100644
> > > --- a/drivers/net/ethernet/google/gve/gve_ptp.c
> > > +++ b/drivers/net/ethernet/google/gve/gve_ptp.c
> > > @@ -10,28 +10,210 @@
> > > /* Interval to schedule a nic timestamp calibration, 250ms. */
> > > #define GVE_NIC_TS_SYNC_INTERVAL_MS 250
> > >
> > > +/*
> > > + * Stores cycle counter samples in get_cycles() units from a
> > > + * sandwiched NIC clock read
> > > + */
> > > +struct gve_sysclock_sample {
> > > + /* system time snapshot taken just before issuing AdminQ command */
> > > + struct system_time_snapshot snapshot;
> > > + /* Cycle counter from NIC before clock read */
> > > + u64 nic_pre_cycles;
> > > + /* Cycle counter from NIC after clock read */
> > > + u64 nic_post_cycles;
> > > + /* Cycle counter from host before issuing AQ command */
> > > + cycles_t host_pre_cycles;
> > > + /* Cycle counter from host after AQ command returns */
> > > + cycles_t host_post_cycles;
> > > +};
> > > +
> > > +/*
> > > + * Read NIC clock by issuing the AQ command. The command is subject to
> > > + * rate limiting and may need to be retried. Requires nic_ts_read_lock
> > > + * to be held.
> > > + */
> > > +static int gve_ptp_read_timestamp(struct gve_ptp *ptp, cycles_t *pre_cycles,
> > > + cycles_t *post_cycles,
> > > + struct system_time_snapshot *snap)
> > > +{
> > > + unsigned long deadline = jiffies + msecs_to_jiffies(100);
> > > + unsigned long delay_us = 1000;
> > > + int err;
> > > +
> > > + lockdep_assert_held(&ptp->nic_ts_read_lock);
> > > +
> > > + do {
> > > + if (snap)
> > > + ktime_get_snapshot(snap);
> > > +
> > > + *pre_cycles = get_cycles();
> > > + err = gve_adminq_report_nic_ts(ptp->priv,
> > > + ptp->nic_ts_report_bus);
> > > +
> > > + /* Prevent get_cycles() from being speculatively executed
> > > + * before the AdminQ command
> > > + */
> > > + rmb();
> > > + *post_cycles = get_cycles();
> > > + if (likely(err != -EAGAIN))
> > > + return err;
> > > +
> > > + fsleep(delay_us);
> > > +
> > > + /* Exponential backoff */
> > > + delay_us *= 2;
> > > + } while (time_before(jiffies, deadline));
> > > +
> > > + return -ETIMEDOUT;
> > > +}
> > > +
> > > /* Read the nic timestamp from hardware via the admin queue. */
> > > -static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw)
> > > +static int gve_clock_nic_ts_read(struct gve_ptp *ptp, u64 *nic_raw,
> > > + struct gve_sysclock_sample *sysclock)
> > > {
> > > + cycles_t host_pre_cycles, host_post_cycles;
> > > + struct gve_nic_ts_report *ts_report;
> > > int err;
> > >
> > > mutex_lock(&ptp->nic_ts_read_lock);
> > > - err = gve_adminq_report_nic_ts(ptp->priv, ptp->nic_ts_report_bus);
> > > - if (err)
> > > + err = gve_ptp_read_timestamp(ptp, &host_pre_cycles, &host_post_cycles,
> > > + sysclock ? &sysclock->snapshot : NULL);
> > > + if (err) {
> > > + dev_err_ratelimited(&ptp->priv->pdev->dev,
> > > + "AdminQ timestamp read failed: %d\n", err);
> > > goto out;
> > > + }
> > >
> > > - *nic_raw = be64_to_cpu(ptp->nic_ts_report->nic_timestamp);
> > > + ts_report = ptp->nic_ts_report;
> > > + *nic_raw = be64_to_cpu(ts_report->nic_timestamp);
> > > +
> > > + if (sysclock) {
> > > + sysclock->nic_pre_cycles = be64_to_cpu(ts_report->pre_cycles);
> > > + sysclock->nic_post_cycles = be64_to_cpu(ts_report->post_cycles);
> > > + sysclock->host_pre_cycles = host_pre_cycles;
> > > + sysclock->host_post_cycles = host_post_cycles;
> > > + }
> > >
> > > out:
> > > mutex_unlock(&ptp->nic_ts_read_lock);
> > > return err;
> > > }
> > >
> > > +struct gve_cycles_to_clock_callback_ctx {
> > > + u64 cycles;
> > > +};
> > > +
> > > +static int gve_cycles_to_clock_fn(ktime_t *device_time,
> > > + struct system_counterval_t *system_counterval,
> > > + void *ctx)
> > > +{
> > > + struct gve_cycles_to_clock_callback_ctx *context = ctx;
> > > +
> > > + *device_time = 0;
> > > +
> > > + system_counterval->cycles = context->cycles;
> > > + system_counterval->use_nsecs = false;
> > > +
> > > + if (IS_ENABLED(CONFIG_X86))
> > > + system_counterval->cs_id = CSID_X86_TSC;
> > > + else if (IS_ENABLED(CONFIG_ARM64))
> > > + system_counterval->cs_id = CSID_ARM_ARCH_COUNTER;
> > > + else
> > > + return -EOPNOTSUPP;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * Convert a raw cycle count (e.g. from get_cycles()) to the system clock
> > > + * type specified by clockid. The system_time_snapshot must be taken before
> > > + * the cycle counter is sampled.
> > > + */
> > > +static int gve_cycles_to_timespec64(struct gve_priv *priv, clockid_t clockid,
> > > + struct system_time_snapshot *snap,
> > > + u64 cycles, struct timespec64 *ts)
> > > +{
> > > + struct gve_cycles_to_clock_callback_ctx ctx = {0};
> > > + struct system_device_crosststamp xtstamp;
> > > + int err;
> > > +
> > > + ctx.cycles = cycles;
> > > + err = get_device_system_crosststamp(gve_cycles_to_clock_fn, &ctx, snap,
> > > + &xtstamp);
> > > + if (err) {
> > > + dev_err_ratelimited(&priv->pdev->dev,
> > > + "get_device_system_crosststamp() failed to convert %lld cycles to system time: %d\n",
> > > + cycles,
> > > + err);
> > > + return err;
> > > + }
> > > +
> > > + switch (clockid) {
> > > + case CLOCK_REALTIME:
> > > + *ts = ktime_to_timespec64(xtstamp.sys_realtime);
> > > + break;
> > > + case CLOCK_MONOTONIC_RAW:
> > > + *ts = ktime_to_timespec64(xtstamp.sys_monoraw);
> > > + break;
> > > + default:
> > > + dev_err_ratelimited(&priv->pdev->dev,
> > > + "Cycle count conversion to clockid %d not supported\n",
> > > + clockid);
> > > + return -EOPNOTSUPP;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > static int gve_ptp_gettimex64(struct ptp_clock_info *info,
> > > struct timespec64 *ts,
> > > struct ptp_system_timestamp *sts)
> > > {
> > > - return -EOPNOTSUPP;
> > > + struct gve_ptp *ptp = container_of(info, struct gve_ptp, info);
> > > + struct gve_sysclock_sample sysclock = {0};
> > > + struct gve_priv *priv = ptp->priv;
> > > + u64 nic_ts;
> > > + int err;
> > > +
> > > + if (sts && !(IS_ENABLED(CONFIG_X86) || IS_ENABLED(CONFIG_ARM64)))
> > > + return -EOPNOTSUPP;
> > > +
> > > + err = gve_clock_nic_ts_read(ptp, &nic_ts, sts ? &sysclock : NULL);
> > > + if (err)
> > > + return err;
> > > +
> > > + if (sts) {
> > > + /* Reject samples with out of order system clock values */
> > > + if (!(sysclock.host_pre_cycles <= sysclock.nic_pre_cycles &&
> > > + sysclock.nic_pre_cycles <= sysclock.nic_post_cycles &&
> > > + sysclock.nic_post_cycles <= sysclock.host_post_cycles)) {
> > > + dev_err_ratelimited(&priv->pdev->dev,
> > > + "AdminQ system clock cycle counts out of order. Expecting %llu <= %llu <= %llu <= %llu\n",
> > > + (u64)sysclock.host_pre_cycles,
> > > + sysclock.nic_pre_cycles,
> > > + sysclock.nic_post_cycles,
> > > + (u64)sysclock.host_post_cycles);
> > > + return -EBADMSG;
> >
> > Sashiko/gemini is reporting the following:
> >
> > ---
> > If older firmware does not support this feature and returns 0 for the
> > pre_cycles and post_cycles fields, won't host_pre_cycles <=
> > nic_pre_cycles evaluate to false since get_cycles() will be greater than
> > 0? If this occurs, the driver returns -EBADMSG instead of -EOPNOTSUPP.
> > Does returning -EBADMSG prevent userspace tools from gracefully falling
> > back to legacy PTP ioctls like PTP_SYS_OFFSET, causing PTP
> > synchronization to fail completely on older firmwares?
> > ---
> >
> > which looks legit to me, or am I missing something? Note that
> > proactively triaging sashiko comments would help maintainers a lot.
> >
> > Thanks,
> >
> > Paolo
> >
>
> Hi Paolo, my apologies for not addressing this proactively.
>
> Firmware is not allowed to return 0 for pre/post cycles. This would
> violate the API contract, so returning failure is the appropriate
> response.
>
> The kernel PTP stack has no special handling for EBADMSG vs
> EOPNOTSUPP, and neither does Chrony
> (https://github.com/mlichvar/chrony/blob/866bedef0b648fdd6cc68f8cb4d75e4303244c76/sys_linux.c#L959).
> Returning EOPNOTSUPP instead of EBADMSG would not change the behavior
> of upper layers.
>
> > > +
> > > + err = gve_cycles_to_timespec64(priv, sts->clockid,
> > > + &sysclock.snapshot,
> > > + sysclock.nic_pre_cycles,
> > > + &sts->pre_ts);
> > > + if (err)
> > > + return err;
> >
> > Sashiko writes:
> >
> > Looking at gve_cycles_to_timespec64(), it relies on
> > get_device_system_crosststamp() which hardcodes the clocksource to
> > CSID_X86_TSC or CSID_ARM_ARCH_COUNTER.
> >
> > If the guest timekeeper clocksource is something else, such as kvm-clock,
> > get_device_system_crosststamp() will return -ENODEV which is propagated here.
> >
> > Since userspace generally only falls back to legacy PTP ioctls when receiving
> > -EOPNOTSUPP or -ENOTTY, will returning -ENODEV cause the extended ioctl to
> > fatally fail and break synchronization? Should we intercept these errors and
> > return -EOPNOTSUPP instead to preserve the fallback behavior?
>
> The system clocksource returned by the callback does not need to match
> the current system clocksource. get_device_system_crosststamp()
> converts the supplied value to the system clocksource using
> convert_base_to_cs(). I've validated that cross-timestamp still works
> when system clocksource is kvm-clock:

This is incorrect, there was a mistake in my testing.
get_device_system_crosststamp() fails if the clocksource is not TSC. I
will ensure there is a fallback to
ptp_read_system_prets()/ptp_read_system_postts() if the system clock
source is not TSC or ARM_ARCH_COUNTER.

>
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> tsc
> $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
> tsc kvm-clock acpi_pm
> $ echo "kvm-clock" | sudo tee
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> kvm-clock
> $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> kvm-clock
> $ sudo systemctl start chrony
> $ chronyc tracking
> Reference ID : 50484330 (PHC0)
> Stratum : 1
> Ref time (UTC) : Tue May 05 18:18:20 2026
> System time : 0.000000703 seconds fast of NTP time
> Last offset : +0.000001757 seconds
> RMS offset : 0.000004796 seconds
> Frequency : 0.164 ppm fast
> Residual freq : +0.046 ppm
> Skew : 3.106 ppm
> Root delay : 0.000000001 seconds
> Root dispersion : 0.000013117 seconds
> Update interval : 0.5 seconds
> Leap status : Normal