Re: [RFC PATCH v2 0/7] Defer throttle when task exits to user

From: K Prateek Nayak
Date: Tue Apr 15 2025 - 06:35:35 EST


On 4/15/2025 2:15 PM, K Prateek Nayak wrote:
(+ Sebastian)

Hello Jan,

On 4/15/2025 11:39 AM, Jan Kiszka wrote:
Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h

To improve the reproducibility, I pinned the two tasks to the same CPU
as the bandwidth timer and I could hit this consistently within a few
minutes at the most.

patience. As we have to succeed with at least 3 race conditions in a
row, that is still not bad... But maybe someone has an idea how to
increase probabilities further.

Looking at run.sh, there are only fair tasks with one of them being run
with cfs bandwidth constraints. Are you saying something goes wrong on
PREEMPT_RT as a result of using bandwidth control on fair tasks?

Yes, exactly. Also our in-field workload that triggers (most likely)
this issue is not using RT tasks itself. Only kernel threads are RT here.


What exactly is the symptom you are observing? Does one of the assert()
trip during the run? Do you see a stall logged on dmesg? Can you provide
more information on what to expect in this 2min - 2hr window?

I've just lost my traces from yesterday ("you have 0 minutes to find a
power adapter"), but I got nice RCU stall warnings in the VM, including
backtraces from the involved tasks (minus the read-lock holder IIRC).
Maybe Florian can drop one of his dumps.

So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
and I saw:

    rcu: INFO: rcu_preempt self-detected stall on CPU
    rcu:     0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0 fqs=0
    rcu:     (t=15001 jiffies g=12713 q=24 ncpus=2)
    rcu: rcu_preempt kthread timer wakeup didn't happen for 15000 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
    rcu:     Possible timer handling issue on cpu=0 timer-softirq=17688
    rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
    rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
    rcu: RCU grace-period kthread stack dump:
    task:rcu_preempt     state:I stack:0     pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00004000
    Call Trace:
     <TASK>
     __schedule+0x401/0x15a0
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? lock_timer_base+0x77/0xb0
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? __pfx_rcu_gp_kthread+0x10/0x10
     schedule+0x27/0xd0
     schedule_timeout+0x76/0x100
     ? __pfx_process_timeout+0x10/0x10
     rcu_gp_fqs_loop+0x10a/0x4b0
     rcu_gp_kthread+0xd3/0x160
     kthread+0xff/0x210
     ? rt_spin_lock+0x3c/0xc0
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x34/0x50
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-dirty #746 PREEMPT_{RT,(full)}
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    RIP: 0010:pv_native_safe_halt+0xf/0x20
    Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
    RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
    RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
    RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
    R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
    R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
    FS:  0000000000000000(0000) GS:ffff8e2d61534000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     default_idle+0x9/0x20
     default_idle_call+0x30/0x100
     do_idle+0x20f/0x250
     ? do_idle+0xb/0x250
     cpu_startup_entry+0x29/0x30
     rest_init+0xde/0x100
     start_kernel+0x733/0xb20
     ? copy_bootdata+0x9/0xb0
     x86_64_start_reservations+0x18/0x30
     x86_64_start_kernel+0xba/0x110
     common_startup_64+0x13e/0x141
     </TASK>

Is this in line with what you are seeing?

These are the backtrace for timer and the individual
epoll-stall threads:

[ 539.155042] task:ktimers/1 state:D stack:0 pid:31 tgid:31 ppid:2 task_flags:0x4208040 flags:0x00004000
[ 539.155047] Call Trace:
[ 539.155049] <TASK>
[ 539.155051] __schedule+0x401/0x15a0
[ 539.155055] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155059] ? propagate_entity_cfs_rq+0x115/0x290
[ 539.155063] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155067] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155070] ? rt_mutex_setprio+0x1c2/0x480
[ 539.155075] schedule_rtlock+0x1e/0x40
[ 539.155078] rtlock_slowlock_locked+0x20e/0xc60
[ 539.155088] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155093] rt_read_lock+0x8f/0x190
[ 539.155099] ep_poll_callback+0x37/0x2b0
[ 539.155105] __wake_up_common+0x78/0xa0
[ 539.155110] timerfd_tmrproc+0x43/0x60
[ 539.155114] ? __pfx_timerfd_tmrproc+0x10/0x10
[ 539.155116] __hrtimer_run_queues+0xfd/0x2e0
[ 539.155124] hrtimer_run_softirq+0x9d/0xf0
[ 539.155128] handle_softirqs.constprop.0+0xc1/0x2a0
[ 539.155134] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 539.155139] run_ktimerd+0x3e/0x80
[ 539.155142] smpboot_thread_fn+0xf3/0x220
[ 539.155147] kthread+0xff/0x210
[ 539.155151] ? rt_spin_lock+0x3c/0xc0
[ 539.155155] ? __pfx_kthread+0x10/0x10
[ 539.155159] ret_from_fork+0x34/0x50
[ 539.155165] ? __pfx_kthread+0x10/0x10
[ 539.155168] ret_from_fork_asm+0x1a/0x30
[ 539.155176] </TASK>

[ 557.323846] task:epoll-stall state:D stack:0 pid:885 tgid:885 ppid:1 task_flags:0x400000 flags:0x00004002
[ 557.323848] Call Trace:
[ 557.323849] <TASK>
[ 557.323851] __schedule+0x401/0x15a0
[ 557.323853] ? rt_write_lock+0x108/0x260
[ 557.323858] schedule_rtlock+0x1e/0x40
[ 557.323860] rt_write_lock+0xaa/0x260
[ 557.323864] do_epoll_wait+0x21f/0x4a0
[ 557.323869] ? __pfx_ep_autoremove_wake_function+0x10/0x10
[ 557.323872] __x64_sys_epoll_wait+0x63/0x100
[ 557.323876] do_syscall_64+0x6f/0x120
[ 557.323879] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323881] ? ksys_read+0x6b/0xe0
[ 557.323883] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323885] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.323887] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323889] ? do_syscall_64+0x7b/0x120
[ 557.323890] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323892] ? ep_send_events+0x26d/0x2b0
[ 557.323896] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323898] ? do_epoll_wait+0x17e/0x4a0
[ 557.323900] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323902] ? __rseq_handle_notify_resume+0xa7/0x500
[ 557.323905] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323907] ? aa_file_perm+0x123/0x4e0
[ 557.323911] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323912] ? get_nohz_timer_target+0x2a/0x180
[ 557.323914] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323916] ? _copy_to_iter+0xa3/0x630
[ 557.323920] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323922] ? timerqueue_add+0x6a/0xc0
[ 557.323924] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323926] ? hrtimer_start_range_ns+0x2e7/0x4a0
[ 557.323931] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323932] ? timerfd_read_iter+0x141/0x2b0
[ 557.323934] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323936] ? security_file_permission+0x123/0x140
[ 557.323940] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323942] ? vfs_read+0x264/0x340
[ 557.323946] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323948] ? ksys_read+0x6b/0xe0
[ 557.323950] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323951] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.323953] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323955] ? do_syscall_64+0x7b/0x120
[ 557.323957] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323958] ? syscall_exit_to_user_mode+0x168/0x1a0
[ 557.323960] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323962] ? do_syscall_64+0x7b/0x120
[ 557.323963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323965] ? do_syscall_64+0x7b/0x120
[ 557.323967] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 557.323968] RIP: 0033:0x7f0371fd3dea
[ 557.323970] RSP: 002b:00007ffd1062cd68 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[ 557.323971] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0371fd3dea
[ 557.323972] RDX: 0000000000000001 RSI: 00007ffd1062cda4 RDI: 0000000000000005
[ 557.323973] RBP: 00007ffd1062ce00 R08: 0000000000000000 R09: 000055ea1d8f12a0
[ 557.323974] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffd1062cf18
[ 557.323975] R13: 000055ea03253249 R14: 000055ea03255d80 R15: 00007f0372121040
[ 557.323979] </TASK>

[ 557.431402] task:epoll-stall-wri state:R running task stack:0 pid:887 tgid:887 ppid:1 task_flags:0x400100 flags:0x00000002
[ 557.431405] Call Trace:
[ 557.431406] <TASK>
[ 557.431408] __schedule+0x401/0x15a0
[ 557.431410] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431412] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431414] ? psi_group_change+0x212/0x460
[ 557.431417] ? pick_eevdf+0x71/0x180
[ 557.431419] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431421] ? update_curr+0x8d/0x240
[ 557.431425] preempt_schedule+0x41/0x60
[ 557.431427] preempt_schedule_thunk+0x16/0x30
[ 557.431431] try_to_wake_up+0x2f6/0x6e0
[ 557.431433] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431437] ep_autoremove_wake_function+0x12/0x40
[ 557.431439] __wake_up_common+0x78/0xa0
[ 557.431443] __wake_up_sync+0x34/0x50
[ 557.431445] ep_poll_callback+0x13e/0x2b0
[ 557.431448] ? aa_file_perm+0x123/0x4e0
[ 557.431451] __wake_up_common+0x78/0xa0
[ 557.431454] __wake_up_sync_key+0x38/0x50
[ 557.431456] anon_pipe_write+0x43b/0x6d0
[ 557.431461] fifo_pipe_write+0x13/0xe0
[ 557.431463] vfs_write+0x374/0x420
[ 557.431468] ksys_write+0xc9/0xe0
[ 557.431471] do_syscall_64+0x6f/0x120
[ 557.431473] ? current_time+0x30/0x130
[ 557.431476] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431479] ? vfs_write+0x1bd/0x420
[ 557.431481] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431483] ? vfs_write+0x1bd/0x420
[ 557.431487] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431489] ? ksys_write+0xc9/0xe0
[ 557.431491] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431492] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431494] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.431496] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431498] ? do_syscall_64+0x7b/0x120
[ 557.431501] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431502] ? ksys_write+0xc9/0xe0
[ 557.431504] ? do_syscall_64+0x7b/0x120
[ 557.431506] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431508] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.431510] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431511] ? do_syscall_64+0x7b/0x120
[ 557.431513] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 557.431515] RIP: 0033:0x7f0ef191c887
[ 557.431516] RSP: 002b:00007ffc15f50948 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 557.431517] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0ef191c887
[ 557.431518] RDX: 0000000000000001 RSI: 0000561a16a0203d RDI: 0000000000000003
[ 557.431519] RBP: 00007ffc15f50970 R08: 0000000000000065 R09: 0000561a45f0e2a0
[ 557.431520] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffc15f50a88
[ 557.431521] R13: 0000561a16a011a9 R14: 0000561a16a03da8 R15: 00007f0ef1a7b040
[ 557.431525] </TASK>




Additionally, do you have RT throttling enabled in your setup? Can long
running RT tasks starve fair tasks on your setup?

RT throttling is enabled (default settings) but was not kicking in - why
should it in that scenario? The only RT thread, ktimers, ran into the
held lock and stopped.

Jan



--
Thanks and Regards,
Prateek