Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling
From: Shrikanth Hegde
Date: Tue Apr 23 2024 - 11:24:11 EST
On 2/13/24 11:25 AM, Ankur Arora wrote:
> Hi,
>
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
>
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
>
> An earlier RFC version is at [4].
>
Hi Ankur/Thomas.
Thank you for this series and previous ones.
These are very interesting patch series and the even more interesting
discussions. I have been trying go through to get different bits of it.
Tried this patch on PowerPC by defining LAZY similar to x86. The change is below.
Kept it at PREEMPT=none for PREEMPT_AUTO.
Running into soft lockup on large systems (40Cores, SMT8) and seeing close to 100%
regression on small system ( 12 Cores, SMT8). More details are after the patch.
Are these the only arch bits that need to be defined? am I missing something very
basic here? will try to debug this further. Any inputs?
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/thread_info.h | 4 +++-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..11e7008f5dd3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -268,6 +268,7 @@ config PPC
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+ select HAVE_PREEMPT_AUTO
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE
select HAVE_RSEQ
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 15c5691dd218..c28780443b3b 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -117,11 +117,13 @@ void arch_setup_new_exec(void);
#endif
#define TIF_POLLING_NRFLAG 19 /* true if poll_idle() is polling TIF_NEED_RESCHED */
#define TIF_32BIT 20 /* 32 bit binary */
+#define TIF_NEED_RESCHED_LAZY 21 /* Lazy rescheduling */
/* as above, but as bit values */
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_NOTIFY_SIGNAL (1<<TIF_NOTIFY_SIGNAL)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)
#define _TIF_32BIT (1<<TIF_32BIT)
@@ -144,7 +146,7 @@ void arch_setup_new_exec(void);
#define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
- _TIF_NOTIFY_SIGNAL)
+ _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED_LAZY)
#define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR)
/* Bits in local_flags */
---------------- Smaller system ---------------------------------
NUMA:
NUMA node(s): 5
NUMA node2 CPU(s): 0-7
NUMA node3 CPU(s): 8-31
NUMA node5 CPU(s): 32-39
NUMA node6 CPU(s): 40-47
NUMA node7 CPU(s): 48-95
Hackbench 6.9 +preempt_auto (=none)
(10 iterations, 10000 loops)
Process 10 groups : 3.00, 3.07( -2.33)
Process 20 groups : 5.47, 5.81( -6.22)
Process 30 groups : 7.78, 8.52( -9.51)
Process 40 groups : 10.16, 11.28( -11.02)
Process 50 groups : 12.37, 13.90( -12.37)
Process 60 groups : 14.58, 16.68( -14.40)
Thread 10 groups : 3.24, 3.28( -1.23)
Thread 20 groups : 5.93, 6.16( -3.88)
Process(Pipe) 10 groups : 1.94, 2.96( -52.58)
Process(Pipe) 20 groups : 2.91, 5.44( -86.94)
Process(Pipe) 30 groups : 4.23, 7.83( -85.11)
Process(Pipe) 40 groups : 5.35, 10.61( -98.32)
Process(Pipe) 50 groups : 6.64, 13.18( -98.49)
Process(Pipe) 60 groups : 7.88, 16.69(-111.80)
Thread(Pipe) 10 groups : 1.92, 3.02( -57.29)
Thread(Pipe) 20 groups : 3.25, 5.36( -64.92)
------------------- Large systems -------------------------
NUMA:
NUMA node(s): 4
NUMA node2 CPU(s): 0-31
NUMA node3 CPU(s): 32-127
NUMA node6 CPU(s): 128-223
NUMA node7 CPU(s): 224-319
watchdog: BUG: soft lockup - CPU#278 stuck for 26s! [hackbench:7137]
Modules linked in: bonding tls rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink pseries_rng vmx_crypto drm drm_panel_orientation_quirks xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 278 PID: 7137 Comm: hackbench Kdump: loaded Tainted: G L 6.9.0-rc1+ #42
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1050.00 (NM1050_052) hv:phyp pSeries
NIP: c000000000037fbc LR: c000000000038324 CTR: c0000000001a8548
REGS: c0000003de72fbb8 TRAP: 0900 Tainted: G L (6.9.0-rc1+)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28002222 XER: 20040000
CFAR: 0000000000000000 IRQMASK: 0
GPR00: c000000000038324 c0000003de72fb90 c000000001973e00 c0000003de72fb88
GPR04: 0000000000240080 0000000000000007 0010000000000000 c000000002220090
GPR08: 4000000000000002 0000000000000049 c0000003f1dcff00 0000000000002000
GPR12: c0000000001a8548 c000001fff72d080 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000002002000
GPR24: 0000000000000001 0000000000000000 0000000002802000 0000000000000002
GPR28: 0000000000000003 fcffffffffffffff fcffffffffffffff c0000003f1dcff00
NIP [c000000000037fbc] __replay_soft_interrupts+0x3c/0x154
LR [c000000000038324] arch_local_irq_restore.part.0+0x1cc/0x214
Call Trace:
[c0000003de72fb90] [c000000000038020] __replay_soft_interrupts+0xa0/0x154 (unreliable)
[c0000003de72fd40] [c000000000038324] arch_local_irq_restore.part.0+0x1cc/0x214
[c0000003de72fd90] [c000000000030268] interrupt_exit_user_prepare_main+0x19c/0x274
[c0000003de72fe00] [c0000000000304e0] syscall_exit_prepare+0x1a0/0x1c8
[c0000003de72fe50] [c00000000000cee8] system_call_vectored_common+0x168/0x2ec
+mpe, nick