Re: [PATCH 1/4] asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
From: Ankur Arora
Date: Thu Mar 06 2025 - 02:54:27 EST
Catalin Marinas <catalin.marinas@xxxxxxx> writes:
> On Mon, Feb 03, 2025 at 01:49:08PM -0800, Ankur Arora wrote:
>> Add smp_cond_load_relaxed_timewait(), a timed variant of
>> smp_cond_load_relaxed(). This is useful for cases where we want to
>> wait on a conditional variable but don't want to wait indefinitely.
>
> Bikeshedding: why not "timeout" rather than "timewait"?
Well my reasons, such as they are, also involved a fair amount of bikeshedding:
- timewait and spinwait have same length names which just minimized all
the indentation issues.
- timeout seems to suggest a timing mechanism of some kind.
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..31de8ed2a05e 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,54 @@ do { \
>> })
>> #endif
>>
>> +#ifndef smp_cond_time_check_count
>> +/*
>> + * Limit how often smp_cond_load_relaxed_timewait() evaluates time_expr_ns.
>> + * This helps reduce the number of instructions executed while spin-waiting.
>> + */
>> +#define smp_cond_time_check_count 200
>> +#endif
>
> While this was indeed added to the poll_idle() loop, it feels completely
> random in a generic implementation. It's highly dependent on the
> time_expr_ns passed.
I agree about this feeling quite out of place. For one thing, this is
exposed unnecessarily to the arch code.
> Can the caller not move the loop in time_expr_ns
> before invoking this macro?
You mean add a loop count conditional of some kind in the time_cond_expr?
Might need to change the direction of some of the checks, but let me
play with that. (The one below won't work.)
Even if we could do that though, it seems that how often the time-check
is done should be an internal detail of the interface. Seems ugly to expose
this to the caller:
flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags,
(VAL & _TIF_NEED_RESCHED),
(count++ % POLL_IDLE_RELAX_COUNT) ||
(local_clock_noinstr() >= time_limit)));
(Anyway more on this below.)
>> +
>> +/**
>> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> + * guarantees until a timeout expires.
>> + * @ptr: pointer to the variable to wait on
>> + * @cond: boolean expression to wait for
>> + * @time_expr_ns: evaluates to the current time
>> + * @time_limit_ns: compared against time_expr_ns
>> + *
>> + * Equivalent to using READ_ONCE() on the condition variable.
>> + *
>> + * Note that the time check in time_expr_ns can be synchronous or
>> + * asynchronous.
>> + * In the generic version the check is synchronous but kept coarse
>> + * to minimize instructions executed while spin-waiting.
>> + */
>
> Not sure exactly what synchronous vs asynchronous here mean. I see the
> latter more like an interrupt. I guess what you have in mind is the WFE
> wakeup events on arm64, though they don't interrupt the instruction
> flow. I'd not bother specifying this at all.
Yeah, with fresh eyes this feels quite unnecessary.
In the generic version the time check is done in the loop but kept coarse
to minimize instructions executed while spin-waiting. Architecture code
might implement this without spin-waiting.
>> +#ifndef __smp_cond_load_relaxed_spinwait
>> +#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, \
>> + time_limit_ns) ({ \
>> + typeof(ptr) __PTR = (ptr); \
>> + __unqual_scalar_typeof(*ptr) VAL; \
>> + unsigned int __count = 0; \
>> + for (;;) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + if (cond_expr) \
>> + break; \
>> + cpu_relax(); \
>> + if (__count++ < smp_cond_time_check_count) \
>> + continue; \
>> + if ((time_expr_ns) >= (time_limit_ns)) \
>> + break; \
>> + __count = 0; \
>> + } \
>> + (typeof(*ptr))VAL; \
>> +})
>> +#endif
>> +
>> +#ifndef smp_cond_load_relaxed_timewait
>> +#define smp_cond_load_relaxed_timewait __smp_cond_load_relaxed_spinwait
>> +#endif
>
> What I don't particularly like about this interface is (1) no idea of
> what time granularity it offers, how much it can slip past the deadline,
> even though there's some nanoseconds implied and
So, one problem with that is that the time granularity probably varies
wildly with the implementation and some don't have a clear idea
of the granularity at all:
- generic version (x86): does not have a well defined granularity. The
assumption is that calling cpu_relax() is "fast" but at least for how
it is used via poll_idle(), each iteration is at least around 2k-4k
cycles. (Also depends on the particular CPU model.)
- arm64 WFE version: well defined, potentially query-able granularity
(say 100us)
- a future arm64 WFET version: similar to the WFE version with a
smaller granularity.
- generic code: arm64 fallback: granularity not well defined. For
implementations that implement YIELD as NOP, it's probably
extremely fast and hot.
> (2) time_expr_ns leaves
> the caller to figure out why time function to use for tracking the time.
> Well, I can be ok with (2) if we make it a bit more generic.
> The way it is written, I guess the type of the time expression and limit
> no longer matters as long as you can compare them. The naming implies
> nanoseconds but we don't have such precision, especially with the WFE
> implementation for arm64. We could add a slack range argument like the
> delta_ns for some of the hrtimer API and let the arch code decide
> whether to honour it.
Yeah and as you say, none of the code actually cares for the unit so
that _ns stuff is unnecessary.
> What about we drop the time_limit_ns and build it into the time_expr_ns
> as a 'time_cond' argument? The latter would return the result of some
> comparison and the loop bails out if true.
Yeah the interface can't actually do anything with the time_expr_ns and
time_limit_ns anyway so just using a combined time_cond seems like a good
idea.
But more below.
> An additional argument would
> be the minimum granularity for checking the time_cond and the arch code
> may decide to fall back to busy loops if the granularity is larger than
> what the caller required.
As you mention mention, there are two problems with the time-check:
constraints on how much it can overflow and how much tardiness we can
bring to bear on the actual evaluation of the presumed expensive
time-check (the smp_cond_time_check_count).
Both of these are conceptually related but the first one only really
applies to WFE like implementations, and the second only to spinning
implementations.
I think some kind of a granularity parameter might help with both of
these.
Taking a step back, we have the following time related parameters:
1. time_expr
2. time_limit
Or (1) and (2) combined via time_cond.
And, these are some different ways to put constraints on the time-check:
3. time_slack: int, representing some kind of percentage/order etc limiting
value for the overflow. This needs a visible value for (2).
4. time_check_coarse: bool, representing coarse or fine.
5. time_granularity: int, specifying the granularity (say
1/10/100/1000 us) the caller wants.
This allows the caller to dynamically choose a particular value
based on (2). poll_idle(), for instance, would evaluate
cpuidle_poll_time() to decide what value to use.
IMO (3), and (5) both have too much precision that neither of the known
users seem to need:
- poll_idle(): the only thing waiting for it on the other side of an
expired timeout is that it goes to a deeper idle state. So, it should
be fine to be a bit late.
- resilient spinlock timeout: current timeout value is 0.25s
(RES_DEF_TIMEOUT [1]) which is pretty large.
So, instead it might be enough for the users to just specify what kind
of timeout or time-check they want (specifying coarse/fine via (4)) and
let the interface use that to handle its overflow and the tardiness
constraints.
With that the generic version could be:
#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_cond, \
time_check_coarse) ({ \
typeof(ptr) __PTR = (ptr); \
__unqual_scalar_typeof(*ptr) VAL; \
unsigned int __count = 0; \
unsigned int __time_check_count = 200; \
\
if (!time_check_coarse) \
__time_check_count /= 4; \
\
for (;;) { \
VAL = READ_ONCE(*__PTR); \
if (cond_expr) \
break; \
cpu_relax(); \
if (__count++ < __time_check_count) \
continue; \
if (time_cond) \
break; \
__count = 0; \
} \
(typeof(*ptr))VAL; \
})
(The constant is still arbitrary, but it seems to fit in the broad
coarse/fine outlines.)
And the arm64 version:
#define smp_cond_load_relaxed_timewait(ptr, cond_expr, \
time_cond, time_check_coarse) \
({ \
__unqual_scalar_typeof(*ptr) _val; \
bool __wfe = arch_timer_evtstrm_available(); \
\
if (time_check_coarse && __wfe) \
_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr, \
time_cond); \
else \
_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr, \
time_cond, \
time_check_coarse); \
(typeof(*ptr))_val; \
})
What do you think?
Oh and thanks for the very helpful review!
[1] https://lore.kernel.org/lkml/20250303152305.3195648-8-memxor@xxxxxxxxx/
--
ankur