Re: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()

From: Paul E. McKenney
Date: Wed Oct 09 2024 - 14:21:14 EST


On Wed, Oct 09, 2024 at 08:07:08PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 09, 2024 at 10:57:24AM -0700, Paul E. McKenney wrote:
> > Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock()
> > to check for excessive CSD-lock wait times. This works, but does not
> > guarantee monotonic timestamps.
>
> It does if you provide a sane TSC

What is this "sane TSC" of which you speak? ;-)

More seriously, the raw reads from the TSC that are carried out by
sched_clock() are not guaranteed to be monotonic due to potential
instruction reordering and the like. This is *not* a theoretical
statement -- we really do see this on the fleet. Very rarely for any
given system, to be sure, but not at all rare across the full set of them.

This results in false-positive CSD-lock complaints claiming almost 2^64
nanoseconds of delay, which are not good complaints to have.

> > Therefore, switch from sched_clock()
> > to ktime_get_mono_fast_ns(), which does guarantee monotonic timestamps,
> > at least in the absence of calls from NMI handlers, which are not involved
> > in this code path.
>
> That can end up using HPET in the case of non-sane TSC.

That is true. However...

> In the good case they're equal, in the bad case you're switching from
> slightly dodgy time to super expensive time. Is that what you want?

If TSC is not sane, we don't want to be using the system at all.
In fact, the super-expensive time will more often than not result in
the system being automatically taken out of service due to the excessive
application-level latencies.

But in the good case where the TSC is sane, we need successive reads
from the TSC to be ordered in order to avoid the false-positive
complaints. Yes, this does add a bit of overhead, but the CPU isn't
doing anything useful anyway, so not a problem. This same lack of
concern might also apply to HPET reads.

Should I upgrade the commit log? Or am I missing your point?

Thanx, Paul

> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > Cc: Neeraj Upadhyay <neeraj.upadhyay@xxxxxxxxxx>
> > Cc: Rik van Riel <riel@xxxxxxxxxxx>
> > Cc: Leonardo Bras <leobras@xxxxxxxxxx>
> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> > Cc: "Peter Zijlstra (Intel)" <peterz@xxxxxxxxxxxxx>
> >
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index f25e20617b7eb..27dc31a146a35 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -246,7 +246,7 @@ static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, in
> > return true;
> > }
> >
> > - ts2 = sched_clock();
> > + ts2 = ktime_get_mono_fast_ns();
> > /* How long since we last checked for a stuck CSD lock.*/
> > ts_delta = ts2 - *ts1;
> > if (likely(ts_delta <= csd_lock_timeout_ns * (*nmessages + 1) *
> > @@ -321,7 +321,7 @@ static void __csd_lock_wait(call_single_data_t *csd)
> > int bug_id = 0;
> > u64 ts0, ts1;
> >
> > - ts1 = ts0 = sched_clock();
> > + ts1 = ts0 = ktime_get_mono_fast_ns();
> > for (;;) {
> > if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
> > break;