Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Mike Galbraith
Date: Sat Apr 20 2024 - 01:58:16 EST


(removes apparently busted bytedance.com address and retries xmit)

Greetings!

With this version, the better CPU distribution (for tbench synchronous
net-blasting) closed the CFS vs EEVDF throughput deficit. I verified
both by rolling the previous version forward and back-porting to 6.1
where I've got CFS and EEVDF to re-compare, now with both dequeue delay
patch versions.

As usual, there will be winners and losers, but (modulo dead buglet) it
looks kinda promising to me.

Distribution of single pinned buddy pair measured in master:
DELAY_DEQUEUE
----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
tbench:(2) | 6277.099 ms | 1597104 | avg: 0.003 ms | max: 0.129 ms | sum: 4334.723 ms |
tbench_srv:(2) | 5724.971 ms | 1682629 | avg: 0.001 ms | max: 0.083 ms | sum: 2076.616 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: | 12021.128 ms | 3280275 | | 1.729 ms | 6425.483 ms |
----------------------------------------------------------------------------------------------------------
client/server CPU distribution ~52%/48%

NO_DELAY_DEQUEUE
----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
tbench:(2) | 6724.774 ms | 1546761 | avg: 0.002 ms | max: 0.409 ms | sum: 2443.549 ms |
tbench_srv:(2) | 5275.329 ms | 1571688 | avg: 0.002 ms | max: 0.086 ms | sum: 2734.151 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: | 12019.641 ms | 3119000 | | 9996.367 ms | 15187.833 ms |
----------------------------------------------------------------------------------------------------------
client/server CPU distribution ~56%/44%

Note switches and delay sum. For tbench, they translate directly to
throughput. The other shoe lands with async CPU hog net-blasters, for
those, scheduler cycles tends to be wasted cycles.

-Mike