Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

From: Ankur Arora
Date: Sat Jun 01 2024 - 07:48:23 EST



Shrikanth Hegde <sshegde@xxxxxxxxxxxxx> writes:

> On 5/28/24 6:04 AM, Ankur Arora wrote:
>> Hi,
>>
>> This series adds a new scheduling model PREEMPT_AUTO, which like
>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>> on explicit preemption points for the voluntary models.
>>
>> The series is based on Thomas' original proposal which he outlined
>> in [1], [2] and in his PoC [3].
>>
>> v2 mostly reworks v1, with one of the main changes having less
>> noisy need-resched-lazy related interfaces.
>> More details in the changelog below.
>>
>
> Hi Ankur. Thanks for the series.
>
> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
> tip/master and tip/sched/core. Mostly due some word differences in the change.
>
> tip/master was at:
> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
> Merge: 5d145493a139 47ff30cc1be7
> Author: Ingo Molnar <mingo@xxxxxxxxxx>
> Date: Tue May 28 12:44:26 2024 +0200
>
> Merge branch into tip/master: 'x86/percpu'
>
>
>
>> The v1 of the series is at [4] and the RFC at [5].
>>
>> Design
>> ==
>>
>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>> PREEMPT_COUNT). This means that the scheduler can always safely
>> preempt. (This is identical to CONFIG_PREEMPT.)
>>
>> Having that, the next step is to make the rescheduling policy dependent
>> on the chosen scheduling model. Currently, the scheduler uses a single
>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>> reschedule is needed.
>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>> scheduler to express two kinds of rescheduling intent: schedule at
>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>> rescheduling while allowing the task on the runqueue to run to
>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>
>> The scheduler decides which need-resched bits are chosen based on
>> the preemption model in use:
>>
>> TIF_NEED_RESCHED TIF_NEED_RESCHED_LAZY
>>
>> none never always [*]
>> voluntary higher sched class other tasks [*]
>> full always never
>>
>> [*] some details elided.
>>
>> The last part of the puzzle is, when does preemption happen, or
>> alternately stated, when are the need-resched bits checked:
>>
>> exit-to-user ret-to-kernel preempt_count()
>>
>> NEED_RESCHED_LAZY Y N N
>> NEED_RESCHED Y Y Y
>>
>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>> none/voluntary preemption policies are in effect. And eager semantics
>> under full preemption.
>>
>> In addition, since this is driven purely by the scheduler (not
>> depending on cond_resched() placement and the like), there is enough
>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>> simply upgrading to a full NEED_RESCHED which can use more coercive
>> instruments like resched IPI to induce a context-switch.
>>
>> Performance
>> ==
>> The performance in the basic tests (perf bench sched messaging, kernbench,
>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>> (See patches
>> "sched: support preempt=none under PREEMPT_AUTO"
>> "sched: support preempt=full under PREEMPT_AUTO"
>> "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>
>> For a macro test, a colleague in Oracle's Exadata team tried two
>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>> backported.)
>>
>> In both tests the data was cached on remote nodes (cells), and the
>> database nodes (compute) served client queries, with clients being
>> local in the first test and remote in the second.
>>
>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>
>>
>> PREEMPT_VOLUNTARY PREEMPT_AUTO
>> (preempt=voluntary)
>> ============================== =============================
>> clients throughput cpu-usage throughput cpu-usage Gain
>> (tx/min) (utime %/stime %) (tx/min) (utime %/stime %)
>> ------- ---------- ----------------- ---------- ----------------- -------
>>
>>
>> OLTP 384 9,315,653 25/ 6 9,253,252 25/ 6 -0.7%
>> benchmark 1536 13,177,565 50/10 13,657,306 50/10 +3.6%
>> (local clients) 3456 14,063,017 63/12 14,179,706 64/12 +0.8%
>>
>>
>> OLTP 96 8,973,985 17/ 2 8,924,926 17/ 2 -0.5%
>> benchmark 384 22,577,254 60/ 8 22,211,419 59/ 8 -1.6%
>> (remote clients, 2304 25,882,857 82/11 25,536,100 82/11 -1.3%
>> 90/10 RW ratio)
>>
>>
>> (Both sets of tests have a fair amount of NW traffic since the query
>> tables etc are cached on the cells. Additionally, the first set,
>> given the local clients, stress the scheduler a bit more than the
>> second.)
>>
>> The comparative performance for both the tests is fairly close,
>> more or less within a margin of error.
>>
>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu, 512GB RAM):
>>
>> "
>> a) Base kernel (6.7),
>> b) v1, PREEMPT_AUTO, preempt=voluntary
>> c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>> d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>
>> Workloads I tested and their %gain,
>> case b case c case d
>> NAS +2.7% +1.9% +2.1%
>> Hashjoin, +0.0% +0.0% +0.0%
>> Graph500, -6.0% +0.0% +0.0%
>> XSBench +1.7% +0.0% +1.2%
>>
>> (Note about the Graph500 numbers at [8].)
>>
>> Did kernbench etc test from Mel's mmtests suite also. Did not notice
>> much difference.
>> "
>>
>> One case where there is a significant performance drop is on powerpc,
>> seen running hackbench on a 320 core system (a test on a smaller system is
>> fine.) In theory there's no reason for this to only happen on powerpc
>> since most of the code is common, but I haven't been able to reproduce
>> it on x86 so far.
>>
>> All in all, I think the tests above show that this scheduling model has legs.
>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>> different enough from the current none/voluntary models that there
>> likely are workloads where performance would be subpar. That needs more
>> extensive testing to figure out the weak points.
>>
>>
>>
> Did test it again on PowerPC. Unfortunately numbers shows there is regression
> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
> smaller system too to confirm. For now I have done the comparison for the hackbench
> where highest regression was seen in v1.
>
> perf stat collected for 20 iterations show higher context switch and higher migrations.
> Could it be that LAZY bit is causing more context switches? or could it be something
> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.

Thanks for trying it out.

As you point out, context-switches and migrations are signficantly higher.

Definitely unexpected. I ran the same test on an x86 box
(Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.

6.9.0/none.process.pipe.60: 170,719,761 context-switches # 0.022 M/sec ( +- 0.19% )
6.9.0/none.process.pipe.60: 16,871,449 cpu-migrations # 0.002 M/sec ( +- 0.16% )
6.9.0/none.process.pipe.60: 30.833112186 seconds time elapsed ( +- 0.11% )

6.9.0-00035-gc90017e055a6/none.process.pipe.60: 177,889,639 context-switches # 0.023 M/sec ( +- 0.21% )
6.9.0-00035-gc90017e055a6/none.process.pipe.60: 17,426,670 cpu-migrations # 0.002 M/sec ( +- 0.41% )
6.9.0-00035-gc90017e055a6/none.process.pipe.60: 30.731126312 seconds time elapsed ( +- 0.07% )

Clearly there's something different going on powerpc. I'm travelling
right now, but will dig deeper into this once I get back.

Meanwhile can you check if the increased context-switches are voluntary or
involuntary (or what the division is)?


Thanks
Ankur

> Meanwhile, will do more test with other micro-benchmarks and post the results.
>
>
> More details below.
> CONFIG_HZ = 100
> ./hackbench -pipe 60 process 100000 loops
>
> ====================================================================================
> On the larger system. (40 Cores, 320CPUS)
> ====================================================================================
> 6.10-rc1 +preempt_auto
> preempt=none preempt=none
> 20 iterations avg value
> hackbench pipe(60) 26.403 32.368 ( -31.1%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 168,980,939.76 msec cpu-clock # 6400.026 CPUs utilized ( +- 6.59% )
> 6,299,247,371 context-switches # 70.596 K/sec ( +- 6.60% )
> 246,646,236 cpu-migrations # 2.764 K/sec ( +- 6.57% )
> 1,759,232 page-faults # 19.716 /sec ( +- 6.61% )
> 577,719,907,794,874 cycles # 6.475 GHz ( +- 6.60% )
> 226,392,778,622,410 instructions # 0.74 insn per cycle ( +- 6.61% )
> 37,280,192,946,445 branches # 417.801 M/sec ( +- 6.61% )
> 166,456,311,053 branch-misses # 0.85% of all branches ( +- 6.60% )
>
> 26.403 +- 0.166 seconds time elapsed ( +- 0.63% )
>
> ++++++++++++
> preempt auto
> ++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 207,154,235.95 msec cpu-clock # 6400.009 CPUs utilized ( +- 6.64% )
> 9,337,462,696 context-switches # 85.645 K/sec ( +- 6.68% )
> 631,276,554 cpu-migrations # 5.790 K/sec ( +- 6.79% )
> 1,756,583 page-faults # 16.112 /sec ( +- 6.59% )
> 700,281,729,230,103 cycles # 6.423 GHz ( +- 6.64% )
> 254,713,123,656,485 instructions # 0.69 insn per cycle ( +- 6.63% )
> 42,275,061,484,512 branches # 387.756 M/sec ( +- 6.63% )
> 231,944,216,106 branch-misses # 1.04% of all branches ( +- 6.64% )
>
> 32.368 +- 0.200 seconds time elapsed ( +- 0.62% )
>
>
> ============================================================================================
> Smaller system ( 12Cores, 96CPUS)
> ============================================================================================
> 6.10-rc1 +preempt_auto
> preempt=none preempt=none
> 20 iterations avg value
> hackbench pipe(60) 55.930 65.75 ( -17.6%)
>
> ++++++++++++++++++
> baseline 6.10-rc1:
> ++++++++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 107,386,299.19 msec cpu-clock # 1920.003 CPUs utilized ( +- 6.55% )
> 1,388,830,542 context-switches # 24.536 K/sec ( +- 6.19% )
> 44,538,641 cpu-migrations # 786.840 /sec ( +- 6.23% )
> 1,698,710 page-faults # 30.010 /sec ( +- 6.58% )
> 412,401,110,929,055 cycles # 7.286 GHz ( +- 6.54% )
> 192,380,094,075,743 instructions # 0.88 insn per cycle ( +- 6.59% )
> 30,328,724,557,878 branches # 535.801 M/sec ( +- 6.58% )
> 99,642,840,901 branch-misses # 0.63% of all branches ( +- 6.57% )
>
> 55.930 +- 0.509 seconds time elapsed ( +- 0.91% )
>
>
> +++++++++++++++++
> v2_preempt_auto
> +++++++++++++++++
> Performance counter stats for 'system wide' (20 runs):
> 126,244,029.04 msec cpu-clock # 1920.005 CPUs utilized ( +- 6.51% )
> 2,563,720,294 context-switches # 38.356 K/sec ( +- 6.10% )
> 147,445,392 cpu-migrations # 2.206 K/sec ( +- 6.37% )
> 1,710,637 page-faults # 25.593 /sec ( +- 6.55% )
> 483,419,889,144,017 cycles # 7.232 GHz ( +- 6.51% )
> 210,788,030,476,548 instructions # 0.82 insn per cycle ( +- 6.57% )
> 33,851,562,301,187 branches # 506.454 M/sec ( +- 6.56% )
> 134,059,721,699 branch-misses # 0.75% of all branches ( +- 6.45% )
>
> 65.75 +- 1.06 seconds time elapsed ( +- 1.61% )

So, the context-switches are meaningfully higher.

--
ankur