Re: [PATCH] tracing/osnoise: Add option to align tlat threads

From: Tomas Glozar

Date: Mon Mar 02 2026 - 03:50:26 EST

so 28. 2. 2026 v 0:50 odesílatel Crystal Wood <crwood@xxxxxxxxxx> napsal:
> > Add an option called TIMERLAT_ALIGN to osnoise/options, together with a
> > corresponding setting osnoise/timerlat_align_us.
> >
> > This option sets the alignment of wakeup times between different
> > timerlat threads, similarly to cyclictest's -A/--aligned option. If
> > TIMERLAT_ALIGN is set, the first thread that reaches the first cycle
> > records its first wake-up time. Each following thread sets its first
> > wake-up time to a fixed offset from the recorded time, and incremenets
> > it by the same offset.
>
> Why not just set the initial timer expiration to be
> "period + cpu * align_us"? Then you wouldn't need any interaction
> between CPUs.

"period + cpu * align_us" wouldn't quite do it, for two reasons:

1. The wake-up timers are set to absolute time, and are incremented by
"period" (once or multiple times, if the timer is significantly
delayed) each cycle. What can be done as an alternative to what v1
does is this: record the current time when starting the timerlat
tracer (I need to reset align_next to zero anyway even with the v1
design, that is a bug in the patch), and increment from that.

2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
run on CPUs 0 and 100 with alignment 10, suddenly, the space between
the threads becomes 1000us, which is equivalent to 0us. I would need
to go through the cpuset and assign numbers from 0 to n to each CPU.
That would guarantee a fixed spacing of the threads independent of
when the threads wake up in the first cycle (unlike the v1 design),
but it would make the implementation more complex, since I would have
to store the numbers.

If I implemented both of those ideas, the interaction between the CPUs
can indeed be gotten rid of. I'm not sure if it is a better solution,
though. Another motivation of recording the first thread wake-up was
that when using user threads, the first thread might be created some
time after the tracer is enabled, and I did not want to have a large
gap that would have to be corrected by the while loop at the end of
wait_next_period().

>
> > kernel/trace/trace_osnoise.c | 34 +++++++++++++++++++++++++++++++++-
> > 1 file changed, 33 insertions(+), 1 deletion(-)
>
> Documentation needs to be updated as well.
>
> Should mention that updating align_us while the timer is running won't
> take effect immediately (unlike period, which does).
>

Good idea, thanks! In general, I'm not expecting the user to change
timerlat parameters during a measurement - but it is supported, and
should be documented.

> > diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> > index dee610e465b9..df1d4529d226 100644
> > --- a/kernel/trace/trace_osnoise.c
> > +++ b/kernel/trace/trace_osnoise.c
> > @@ -58,6 +58,7 @@ enum osnoise_options_index {
> > OSN_PANIC_ON_STOP,
> > OSN_PREEMPT_DISABLE,
> > OSN_IRQ_DISABLE,
> > + OSN_TIMERLAT_ALIGN,
> > OSN_MAX
> > };
> >
> > @@ -66,7 +67,8 @@ static const char * const osnoise_options_str[OSN_MAX] = {
> > "OSNOISE_WORKLOAD",
> > "PANIC_ON_STOP",
> > "OSNOISE_PREEMPT_DISABLE",
> > - "OSNOISE_IRQ_DISABLE" };
> > + "OSNOISE_IRQ_DISABLE",
> > + "TIMERLAT_ALIGN" };
>
> Do we really need a flag for this, or can we just interpret a non-zero
> align_us value as enabling the feature?
>

Yes, we need a flag for this, because a zero alignment is a common use case.

I used it in cyclictest to measure the overhead of a large number of
threads waking up at the same time. Similarly, a non-zero alignment
will get rid of most of that overhead. Without alignment set, the
thread wake-ups offsets are semi-random, depending on how the threads
wake up, which might lead to inconsistent results where one run has
good numbers and another run bad numbers, since the alignment is
determined in the first cycle.

For example, here are some bpftrace numbers (from the same command as
in the original message) with wake-ups with timerlat_align_us = 0:

@time[12]: 1
@time[11]: 1
@time[13]: 1
@time[10]: 3
@time[16]: 6
@time[19]: 6
@time[15]: 6
@time[18]: 7
@time[14]: 7
@time[17]: 71
@time[20]: 997

> > @@ -1820,6 +1824,7 @@ static int wait_next_period(struct timerlat_variables *tlat)
> > {
> > ktime_t next_abs_period, now;
> > u64 rel_period = osnoise_data.timerlat_period * 1000;
> > + static atomic64_t align_next;
>
> How will this get reset if the tracer is stopped and restarted?
>

It won't, I forgot to reset it. See my comment above in my reply to Steve,

> > now = hrtimer_cb_get_time(&tlat->timer);
> > next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> > @@ -1829,6 +1834,17 @@ static int wait_next_period(struct timerlat_variables *tlat)
> > */
> > tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> >
> > + if (test_bit(OSN_TIMERLAT_ALIGN, &osnoise_options) && !tlat->count
> > + && atomic64_cmpxchg_relaxed(&align_next, 0, tlat->abs_period)) {
> > + /*
> > + * Align thread in first cycle on each CPU to the set alignment.
> > + */
> > + tlat->abs_period = atomic64_fetch_add_relaxed(osnoise_data.timerlat_align_us * 1000,
> > + &align_next);
> > + tlat->abs_period += osnoise_data.timerlat_align_us * 1000;
> > + next_abs_period = ns_to_ktime(tlat->abs_period);
> > + }
>
> I'm already unclear about the existing purpose of next_abs_period, but
> if it has any use at all shouldn't it be to avoid writing intermediate
> values like this back to tlat?
>

next_abs_period is basically just the ktime_t variant of
tlat->abs_period for local calculations of the next period inside
wait_next_period(). Its only purpose is the ktime_compare() call that
increments tlat->abs_period by the period until it lands into the
future, if it happens to be in the past. This is necessary to do for
both a regular cycle (which might take long due to noise) and the
first cycle with alignment (because the other thread's first wake up
might be late), so it has to be set in the new code as well,
otherwise, the while loop won't see the time is in the past.

I agree that this part of the code is confusing. There is also a
field, timerlat_variables.rel_period (tlat->rel_period), that is not
used anywhere, since the relative period is pulled out of
osnoise_variables. Something like this would be easier to read and
comprehend, IMHO:

/*
* wait_next_period - Wait for the next period for timerlat
*/
static int wait_next_period(struct timerlat_variables *tlat)
{
ktime_t now;
u64 rel_period = osnoise_data.timerlat_period * 1000;

now = hrtimer_cb_get_time(&tlat->timer);

/*
* Set the next abs_period.
*/
tlat->abs_period += rel_period;

/*
* If the new abs_period is in the past, skip the activation.
*/
while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
}

set_current_state(TASK_INTERRUPTIBLE);

hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
schedule();
return 1;
}

(Excluding the changes from this patch.) What do you think?

Tomas