Re: [RFC PATCH 0/3] sched: add ability to throttle sched_yield() calls to reduce contention
From: Kuba Piecuch
Date: Wed Aug 20 2025 - 11:54:40 EST
On Tue, Aug 19, 2025 at 4:08 PM Kuba Piecuch <jpiecuch@xxxxxxxxxx> wrote:
>
> On Thu, Aug 14, 2025 at 4:53 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote:
> > > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote:
> > > > > Problem statement
> > > > > =================
> > > > >
> > > > > Calls to sched_yield() can touch data shared with other threads.
> > > > > Because of this, userspace threads could generate high levels of contention
> > > > > by calling sched_yield() in a tight loop from multiple cores.
> > > > >
> > > > > For example, if cputimer is enabled for a process (e.g. through
> > > > > setitimer(ITIMER_PROF, ...)), all threads of that process
> > > > > will do an atomic add on the per-process field
> > > > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside
> > > > > account_group_exec_runtime(), which is called inside update_curr().
> > > > >
> > > > > Currently, calling sched_yield() will always call update_curr() at least
> > > > > once in schedule(), and potentially one more time in yield_task_fair().
> > > > > Thus, userspace threads can generate quite a lot of contention for the
> > > > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of
> > > > > a process call sched_yield() in a tight loop.
> > > > >
> > > > > At Google, we suspect that this contention led to a full machine lockup in
> > > > > at least one instance, with ~50% of CPU cycles spent in the atomic add
> > > > > inside account_group_exec_runtime() according to
> > > > > `perf record -a -e cycles`.
> > > >
> > > > I've gotta ask, WTH is your userspace calling yield() so much?
> > >
> > > The code calling sched_yield() was in the wait loop for a spinlock. It
> > > would repeatedly yield until the compare-and-swap instruction succeeded
> > > in acquiring the lock. This code runs in the SIGPROF handler.
> >
> > Well, then don't do that... userspace spinlocks are terrible, and
> > bashing yield like that isn't helpful either.
> >
> > Throttling yield seems like entirely the wrong thing to do. Yes, yield()
> > is poorly defined (strictly speaking UB for anything not FIFO/RR) but
> > making it actively worse doesn't seem helpful.
> >
> > The whole itimer thing is not scalable -- blaming that on yield seems
> > hardly fair.
> >
> > Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and
> > SIGEV_SIGNAL instead?
>
> I agree that there are userspace changes we can make to reduce contention
> and prevent future lockups. What that doesn't address is the potential for
> userspace to trigger kernel lockups, maliciously or unintentionally, via
> spamming yield(). This patch series introduces a way to reduce contention
> and risk of userspace-induced lockups regardless of userspace behavior
> -- that's the value proposition.
At a more basic level, we need to agree that there's a kernel issue here
that should be resolved: userspace potentially being able to trigger a hard
lockup via suboptimal/inappropriate use of syscalls.
Not long ago, there was a similar issue involving getrusage() [1]: a
process with many threads was causing hard lockups when the threads were
calling getrusage() too frequently. You could've said "don't call
getrusage() so much", but that would be addressing a symptom, not the
cause.
Granted, the fix in that case [2] was more elegant and less hacky than
what I'm proposing here, but there are alternative approaches that we can
pursue. We just need to agree that there's a problem in the kernel that
needs to be solved.
[1]: https://lore.kernel.org/all/20240117192534.1327608-1-dylanbhatch@xxxxxxxxxx/
[2]: https://lore.kernel.org/all/20240122155023.GA26169@xxxxxxxxxx/