Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

From: Mel Gorman

Date: Thu Nov 13 2025 - 04:10:34 EST


On Wed, Nov 12, 2025 at 03:48:23PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 12:25:21PM +0000, Mel Gorman wrote:
>
> > + /* Prefer picking wakee soon if appropriate. */
> > + if (sched_feat(NEXT_BUDDY) &&
> > + set_preempt_buddy(cfs_rq, wake_flags, pse, se)) {
> > +
> > + /*
> > + * Decide whether to obey WF_SYNC hint for a new buddy. Old
> > + * buddies are ignored as they may not be relevant to the
> > + * waker and less likely to be cache hot.
> > + */
> > + if (wake_flags & WF_SYNC)
> > + preempt_action = preempt_sync(rq, wake_flags, pse, se);
> > + }
>
> Why only do preempt_sync() when NEXT_BUDDY? Nothing there seems to
> depend on buddies.

There isn't a direct relation, but there is an indirect one. I know from
your previous review that you separated out the WF_SYNC but after a while,
I did not find a good reason to separate it completely from NEXT_BUDDY.

NEXT_BUDDY updates cfs_rq->next if appropriate to indicate there is a waker
relationship between two tasks and potentially share data that may still
be cache resident after a context switch. WF_SYNC indicates there may be
a strict relationship between those two tasks that the waker may need the
wakee to do some work before it can make progress. If NEXT_BUDDY does not
set cfs_rq->next in the current waking context then the wakee may only be
picked next by coincidence under normal EEVDF rules.

WF_SYNC could still reschedule if the wakee is not selected as a buddy but
the benefit, if any, would be marginal -- if the waker does not go to sleep
then WF_SYNC contract is violated and if the data becomes cache cold after
a wakeup delay then the shared data may already be evicted from cache.
With NEXT_BUDDY, there is a chance that the cost of a reschedule and/or
a context switch will be offset by reduced overall latency (e.g. fewer
cache misses). Without NEXT_BUDDY, WF_SYNC may only incur costs due to
context switching.

I considered the possibility of WF_SYNC being applied if pse is already a
buddy due to yield or some other factor but there is no reason to assume
any shared data is still cache resident and it's not easy to reason about. I
considered applying WF_SYNC if pse was already set and use it as a two-pass
filter but again, no obvious benefit or why the second wakeup ie more
important than the first wakeup. I considered WF_SYNC being applied if
any buddy is set but it's not clear why a SYNC wakeup between tasks A,B
should instead pick C to run ASAP outside of the normal EEVDF rules.

I think it's straight-forward if the logic is

o If NEXT_BUDDY sets the wakee becomes cfs_rq->next then
schedule the wakee soon
o If the wakee is to be selected soon and WF_SYNC is also set then
pick the wakee ASAP

but less straight-forward if

o If WF_SYNC is set, reschedule now and maybe the wakee will be
picked, maybe the waker will run again, maybe something else
will run and sometimes it'll be a gain overall.

--
Mel Gorman
SUSE Labs