Re: [PATCH] kernel/sched/fair: Fix to not require calculation for the weight nice0

From: Peter Zijlstra

Date: Fri May 29 2026 - 08:04:43 EST

On Fri, May 29, 2026 at 07:37:07AM +0000, Hongyan Xia wrote:
> On 5/29/2026 10:34 AM, Li kunyu wrote:
> > Typically, the default priority for client tasks is nice0, and reducing
> > the conversion of virtual runtime to real time for nice0 tasks can
> > significantly reduce unnecessary computations.
> >
> > Signed-off-by: Li kunyu <likunyu10@xxxxxxx>
> > ---
> > kernel/sched/fair.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 69361c63353a..74d1c77a8bcf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7033,7 +7033,10 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
> > resched_curr(rq);
> > return;
> > }
> > - delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> > + if (unlikely(se->load.weight != NICE_0_LOAD))
> > + delta = (se->load.weight * vdelta) / NICE_0_LOAD;
> > + else
> > + delta = vdelta;
> >
> > /*
> > * Correct for instantaneous load of other classes.
>
> Given NICE_0_LOAD is a nice power-of-two which compiles down to just a
> bit shift, it seems interesting that you would find the multiplication
> to be 'significant unnecessary computations'. Do you have any data to
> support this?

Notably, branches can be many times more expensive than a mult on modern
deeply pipelined machines. Divisions are a bit of a mixed bag, but mult
is generally dirt cheap.

According to Gemini we have something like so:

+---------------------------------------------------------------------------------------------------+
| ARCHITECTURE | BRANCH MISPREDICT PENALTY | 64-BIT INTEGER MULTIPLY | 64-BIT INTEGER DIVIDE |
| (Modern Cores) | (Clock Cycles) | (Latency / Throughput) | (Latency / Throughput) |
+-------------------+----------------------------+-------------------------+------------------------+

| | | | |
| Apple M-Series | 16 to 20 cycles | 3 to 4 cycles | 7 to 9 cycles |
| (M1 through M5) | | 0.5 cycle thr. (2/clk) | 2 cycles throughput |
| | | | |
+-------------------+----------------------------+-------------------------+------------------------+

| | | | |
| Intel Core | 14 to 15 cycles * | 3 cycles | 18 to 25 cycles |
| (Panther / Arrow) | | 1 cycle thr. (1/clk) | 10 to 15 cycles thr. |
| | | | |
+-------------------+----------------------------+-------------------------+------------------------+

| | | | |
| AMD Zen | 17 to 20 cycles | 3 cycles | 12 to 14 cycles |
| (Zen 4 / Zen 5) | | 1 cycle thr. (1/clk) | 3 to 4 cycles thr. |
| | | | |
+-------------------+----------------------------+-------------------------+------------------------+

So the branch in calc_delta_fair() might still be justified, esp. if it
is predicted well. But like Hongyan noted, not in this case.