Re: [PATCH] tracing/osnoise: Add option to align tlat threads
From: Tomas Glozar
Date: Fri Mar 06 2026 - 10:16:23 EST
út 3. 3. 2026 v 4:21 odesílatel Crystal Wood <crwood@xxxxxxxxxx> napsal:
> > 1. The wake-up timers are set to absolute time, and are incremented by
> > "period" (once or multiple times, if the timer is significantly
> > delayed) each cycle. What can be done as an alternative to what v1
> > does is this: record the current time when starting the timerlat
> > tracer (I need to reset align_next to zero anyway even with the v1
> > design, that is a bug in the patch), and increment from that.
>
> I was only talking about doing this for the initial expiration, not on
> increment.
>
Ah, I misread "period" as "relative period", since the initial
absolute period is determined from the current time in the first cycle
of each thread.
> > 2. "cpu" makes a poor thread ID here. If my period is 1000us, and I
> > run on CPUs 0 and 100 with alignment 10, suddenly, the space between
> > the threads becomes 1000us, which is equivalent to 0us. I would need
> > to go through the cpuset and assign numbers from 0 to n to each CPU.
> > That would guarantee a fixed spacing of the threads independent of
> > when the threads wake up in the first cycle (unlike the v1 design),
> > but it would make the implementation more complex, since I would have
> > to store the numbers.
>
> Right, I was thinking of just a few CPUs missing not being a big deal,
> but on big systems with only a few CPUs running the test it does
> matter.
>
My point is mostly that the spacing of the threads shouldn't depend on
the CPU numbers. One has to make sure the alignment doesn't overflow
the period anyway if they want to have completely time-isolated
wake-ups.
> Instead of assigning numbers, could you just loop over each CPU's
> tlat->abs_period and set the initial expiration with appropriate
> offset (prior to starting any of the threads)? Then the thread would
> not need to care about anything other than the usual increment.
>
> > If I implemented both of those ideas, the interaction between the CPUs
> > can indeed be gotten rid of. I'm not sure if it is a better solution,
> > though. Another motivation of recording the first thread wake-up was
> > that when using user threads, the first thread might be created some
> > time after the tracer is enabled, and I did not want to have a large
> > gap that would have to be corrected by the while loop at the end of
> > wait_next_period().
>
> What is the actual concerning impact here?
>
> If we want to be really paranoid we could check for pending signals
> during the loop, in case userspace delayed so long (with a very short
> period) that the user has a hard time with ctrl+c and such... but that
> could happen already if userspace does something silly (e.g. stopped
> by a debugger?) between loops.
>
The while loop is designed only to handle "small" time differences,
with respect to the relative period. When using timerlat manually with
a user workload, it might take the user a few seconds/seconds/hours
before they start the user process (typing the command line, or if the
user e.g. has a snack or coffee in between), which has to be corrected
by the while loop. This does not interact well with a low period.
Consider the following scenario, assuming the initial absolute period
is set on timerlat tracer enablement:
1. The user enables the timerlat tracer with NO_OSNOISE_WORKLOAD and
100us period.
2. The user steps away for 1 hour.
3. After 10 seconds, tlat->abs_period is 3 600 000 000us in the past.
The while loop starts incrementing tlat->abs_period by 100us, taking
36 000 000 loops. If one iteration takes 10 CPU cycles on a 1GHz CPU,
the while loop itself will take 360us (which is >100us).
4. The timerlat thread never wakes up, since the wake-up time even
after the correction is in the past.
This is much more reasonable than the user stopping the thread inside
the while loop. Actually, this scenario can already happen without
alignment, since the scheduler might preempt the thread inside the
while loop for more than the period - but that is a separate issue.
Also, the current implementation is relatively simple (and hopefully
also easy to understand with the comments in v2), so my idea is that
we can use it for now, and if we want deterministing alignment in the
future, we can always improve it.
> > Good idea, thanks! In general, I'm not expecting the user to change
> > timerlat parameters during a measurement - but it is supported, and
> > should be documented.
>
> Maybe better to phrase it as "not guaranteed to take effect
> immediately" :-)
>
I think we can afford to be less vague than that here :) something
like, "setting the timerlat_align_us option will take affect upon next
timerlat tracer start".
> > I used it in cyclictest to measure the overhead of a large number of
> > threads waking up at the same time. Similarly, a non-zero alignment
> > will get rid of most of that overhead. Without alignment set, the
> > thread wake-ups offsets are semi-random, depending on how the threads
> > wake up, which might lead to inconsistent results where one run has
> > good numbers and another run bad numbers, since the alignment is
> > determined in the first cycle.
>
> OK, I was viewing the staggering as the main point, but I see how the
> alignment itself helps too.
>
> Is there a use case for not always doing the alignment?
> Other than people asking why their numbers suddenly got worse...
>
Yes - to simulate the default behavior of cyclictest without
-A/--aligned, and of multi-thread cyclic workloads that do not align
their threads respectively. Even if we wanted to always use the
alignment, it should not default to 0 IMHO, so that users don't see a
degradation like you mention.
> > next_abs_period is basically just the ktime_t variant of
> > tlat->abs_period for local calculations of the next period inside
> > wait_next_period(). Its only purpose is the ktime_compare() call that
> > increments tlat->abs_period by the period until it lands into the
> > future, if it happens to be in the past. This is necessary to do for
> > both a regular cycle (which might take long due to noise) and the
> > first cycle with alignment (because the other thread's first wake up
> > might be late), so it has to be set in the new code as well,
> > otherwise, the while loop won't see the time is in the past.
>
> Oh, I missed the unit difference (though it's basically just a typedef
> at this point).
>
> > I agree that this part of the code is confusing. There is also a
> > field, timerlat_variables.rel_period (tlat->rel_period), that is not
> > used anywhere, since the relative period is pulled out of
> > osnoise_variables. Something like this would be easier to read and
> > comprehend, IMHO:
>
> Yeah, I noticed that as well... we should remove it if we're not going
> to use it.
>
Yeah.
> > /*
> > * wait_next_period - Wait for the next period for timerlat
> > */
> > static int wait_next_period(struct timerlat_variables *tlat)
> > {
> > ktime_t now;
> > u64 rel_period = osnoise_data.timerlat_period * 1000;
> >
> > now = hrtimer_cb_get_time(&tlat->timer);
> >
> > /*
> > * Set the next abs_period.
> > */
> > tlat->abs_period += rel_period;
> >
> > /*
> > * If the new abs_period is in the past, skip the activation.
> > */
> > while (ktime_compare(now, ns_to_ktime(tlat->abs_period) > 0) {
> > next_abs_period = ns_to_ktime(tlat->abs_period + rel_period);
> > tlat->abs_period = (u64) ktime_to_ns(next_abs_period);
> > }
> >
> > set_current_state(TASK_INTERRUPTIBLE);
> >
> > hrtimer_start(&tlat->timer, next_abs_period, HRTIMER_MODE_ABS_PINNED_HARD);
> > schedule();
> > return 1;
> > }
> >
> > (Excluding the changes from this patch.) What do you think?
>
> Why can't we just make tlat->abs_period and every other time variable
> in this file be ktime_t? Other than atomic stuff if we do go that
> route.
>
> Not saying that that should hold up this patch, just an idea to simplify things.
>
The reason for that is that the code does arithmetic directly on the
ns unit form of the time, without the need to use ktime_add(). I don't
see anything on top of that. I see ktime_add(x, y) is just x + y
nowadays, so that would work.
Tomas