Re: [PATCH 17/24] sched/fair: Implement delayed dequeue

From: Mike Galbraith
Date: Sat Nov 02 2024 - 00:33:51 EST


On Fri, 2024-11-01 at 16:07 -0400, Phil Auld wrote:


> Thanks for jumping in.  My jargon decoder ring seems to be failing me
> so I'm not completely sure what you are saying below :)
>
> "buddies" you mean tasks that waking each other up and sleeping.
> And one runs for longer than the other, right?

Yeah, buddies are related waker/wakee 1:1 1:N or M:N, excluding tasks
happening to be sitting on a CPU where, say a timer fires, an IRQ leads
to a wakeup of lord knows what, lock wakeups etc etc etc. I think Peter
coined the term buddy to mean that (less typing), and it stuck.

> > 1 tbench buddy pair scheduled cross core.
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> > 13770 root      20   0   21424   1920   1792 S 60.13 0.012   0:33.81 3 tbench
> > 13771 root      20   0    4720    896    768 S 46.84 0.006   0:26.10 2 tbench_srv
>  
> > Note 60/47 utilization, now pinned/stacked.
> >
> > 6.1.114-cfs
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
> >  4407 root      20   0   21424   1980   1772 R 50.00 0.012   0:29.20 3 tbench
> >  4408 root      20   0    4720    124      0 R 50.00 0.001   0:28.76 3 tbench_srv
>
> What is the difference between these first two?  The first is on
> separate cores so they don't interfere with each other? And the second is
> pinned to the same core?

Yeah, see 'P'. Given CPU headroom, a tbench pair can consume ~107%.
They're not fully synchronous.. wouldn't be relevant here/now if they
were :)

> > Note what happens to the lighter tbench_srv. Consuming red hot L2 data,
> > it can utilize a full 50%, but it must first preempt wide bottom buddy.
> >
>
> We've got "light" and "wide" here which is a bit mixed metaphorically
> :)

Wide, skinny, feather-weight or lard-ball, they all work for me.

> So here CFS is letting the wakee preempt the waker and providing pretty
> equal fairness. And hot l2 caching is masking the assymmetry.

No, it's way simpler: preemption slices through the only thing it can
slice through, the post wakeup concurrent bits.. that otherwise sits
directly in the communication stream as a lump of latency in a latency
bound operation.

>
> With wakeup preemption off it doesn't help in my case. I was thinking
> maybe the preemption was preventing some batching of IO completions
> or
> initiations. But that was wrong it seems.

Dunno.

> Does it also possibly make wakeup migration less likely and thus increase
> stacking?

The buddy being preempted certainly won't be wakeup migrated, because
it won't sleep. Two very sleepy tasks when bw constrained becomes one
100% hog and one 99.99% hog when CPU constrained.

> > Bottom line, box full of 1:1 buddies pairing up and stacking in L2.
> >
> > tbench 8
> > 6.1.114-cfs      3674.37 MB/sec
> > 6.1.114-eevdf    3505.25 MB/sec -delay_dequeue
> >                  3701.66 MB/sec +delay_dequeue
> >
> > For tbench, preemption = shorter turnaround = higher throughput.
>
> So here you have a benchmark that gets a ~5% boost from
> delayed_dequeue.
>
> But I've got one that get's a 20% penalty so I'm not exactly sure what
> to make of that. Clearly FIO does not have the same pattern as tbench.

There are basically two options in sched-land, shave fastpath cycles,
or some variant of Rob Peter to pay Paul ;-)

-Mike