Re: [PATCH 2/2] sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals

From: Peter Zijlstra

Date: Thu Oct 30 2025 - 05:11:04 EST

On Mon, Oct 27, 2025 at 01:39:15PM +0000, Mel Gorman wrote:
> +static inline enum preempt_wakeup_action
> +__do_preempt_buddy(struct rq *rq, struct cfs_rq *cfs_rq, int wake_flags,
> + struct sched_entity *pse, struct sched_entity *se)
> +{
> + bool pse_before;
> +
> + /*
> + * Ignore wakee preemption on WF_WORK as it is less likely that
> + * there is shared data as exec often follow fork. Do not
> + * preempt for tasks that are sched_delayed as it would violate
> + * EEVDF to forcibly queue an ineligible task.
> + */
> + if (!sched_feat(NEXT_BUDDY) ||

This seems wrong, that would mean wakeup preemption gets killed the
moment you disable NEXT_BUDDY, that can't be right.

> + (wake_flags & WF_FORK) ||
> + (pse->sched_delayed)) {
> + return PREEMPT_WAKEUP_NONE;
> + }
> +
> + /* Reschedule if waker is no longer eligible. */
> + if (!entity_eligible(cfs_rq, se))
> + return PREEMPT_WAKEUP_RESCHED;

That comment isn't accurate, unless you add: && in_task(). That is, if
this is an interrupt doing the wakeup, it has nothing to do with
current.

> + /*
> + * Keep existing buddy if the deadline is sooner than pse.
> + * The downside is that the older buddy may be cache cold
> + * but that is unpredictable where as an earlier deadline
> + * is absolute.
> + */
> + if (cfs_rq->next && entity_before(cfs_rq->next, pse))
> + return PREEMPT_WAKEUP_NONE;

But if previously we set next and didn't preempt, we should try again,
maybe it has more success now. That is, should this not be _NEXT?

> +
> + set_next_buddy(pse);
> +
> + /*
> + * WF_SYNC|WF_TTWU indicates the waker expects to sleep but it is not
> + * strictly enforced because the hint is either misunderstood or
> + * multiple tasks must be woken up.
> + */
> + pse_before = entity_before(pse, se);
> + if (wake_flags & WF_SYNC) {
> + u64 delta = rq_clock_task(rq) - se->exec_start;
> + u64 threshold = sysctl_sched_migration_cost;
> +
> + /*
> + * WF_SYNC without WF_TTWU is not expected so warn if it
> + * happens even though it is likely harmless.
> + */
> + WARN_ON_ONCE(!(wake_flags | WF_TTWU));

s/|/&/ ?

> + if ((s64)delta < 0)
> + delta = 0;
> +
> + /*
> + * WF_RQ_SELECTED implies the tasks are stacking on a
> + * CPU when they could run on other CPUs. Reduce the
> + * threshold before preemption is allowed to an
> + * arbitrary lower value as it is more likely (but not
> + * guaranteed) the waker requires the wakee to finish.
> + */
> + if (wake_flags & WF_RQ_SELECTED)
> + threshold >>= 2;
> +
> + /*
> + * As WF_SYNC is not strictly obeyed, allow some runtime for
> + * batch wakeups to be issued.
> + */
> + if (pse_before && delta >= threshold)
> + return PREEMPT_WAKEUP_RESCHED;
> +
> + return PREEMPT_WAKEUP_NONE;
> + }
> +
> + return PREEMPT_WAKEUP_NEXT;
> +}

Add to this that AFAICT your patch ends up doing:

__pick_eevdf(.protect = false) == pse

which unconditionally disables the slice protection feature.