Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error

From: Tomas Glozar
Date: Mon Dec 16 2024 - 09:38:44 EST


ne 15. 12. 2024 v 19:41 odesílatel Paul E. McKenney <paulmck@xxxxxxxxxx> napsal:
>
> And the fix for the TREE03 too-short grace periods is finally in, at
> least in prototype form:
>
> https://lore.kernel.org/all/da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop/
>
> Or this commit on -rcu:
>
> 22bee20913a1 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
>
> This passes more than 30 hours of 400 concurrent instances of rcutorture's
> TREE03 scenario, with modifications that brought the bug reproduction
> rate up to 50 per hour. I therefore have strong reason to believe that
> this fix is a real fix.
>
> With this fix in place, a 20-hour run of 400 concurrent instances
> of rcutorture's TREE03 scenario resulted in 50 instances of the
> enqueue_dl_entity() splat pair. One (untrimmed) instance of this pair
> of splats is shown below.
>
> You guys did reproduce this some time back, so unless you tell me
> otherwise, I will assume that you have this in hand. I would of course
> be quite happy to help, especially with adding carefully chosen debug
> (heisenbug and all that) or testing of alleged fixes.
>

The same splat was recently reported to LKML [1] and a patchset was
sent and merged into tip/sched/urgent that fixes a few bugs around
double-enqueue of the deadline server [2]. I'm currently re-running
TREE03 with those patches, hoping they will also fix this issue.

Also, last week I came up with some more extensive tracing, which
showed dl_server_update and dl_server_start happening right after each
other, apparently during the same run of enqueue_task_fair (see
below). I'm currently looking into that to figure out whether the
mechanism shown by the trace is fixed by the patchset.

--------------------------

rcu_tort-148 1dN.3. 20531758076us : dl_server_stop <-dequeue_entities
rcu_tort-148 1dN.2. 20531758076us : dl_server_queue: cpu=1
level=2 enqueue=0
rcu_tort-148 1dN.3. 20531758078us : <stack trace>
=> trace_event_raw_event_dl_server_queue
=> dl_server_stop
=> dequeue_entities
=> dequeue_task_fair
=> __schedule
=> schedule
=> schedule_hrtimeout_range_clock
=> torture_hrtimeout_us
=> rcu_torture_writer
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
rcu_tort-148 1dN.3. 20531758095us : dl_server_update <-update_curr
rcu_tort-148 1dN.3. 20531758097us : dl_server_update <-update_curr
rcu_tort-148 1dN.2. 20531758101us : dl_server_queue: cpu=1
level=2 enqueue=1
rcu_tort-148 1dN.3. 20531758103us : <stack trace>
rcu_tort-148 1dN.2. 20531758104us : dl_server_queue: cpu=1
level=1 enqueue=1
rcu_tort-148 1dN.3. 20531758106us : <stack trace>
rcu_tort-148 1dN.2. 20531758106us : dl_server_queue: cpu=1
level=0 enqueue=1
rcu_tort-148 1dN.3. 20531758108us : <stack trace>
=> trace_event_raw_event_dl_server_queue
=> rb_insert_color
=> enqueue_dl_entity
=> update_curr_dl_se
=> update_curr
=> enqueue_task_fair
=> enqueue_task
=> activate_task
=> attach_task
=> sched_balance_rq
=> sched_balance_newidle.constprop.0
=> pick_next_task_fair
=> __schedule
=> schedule
=> schedule_hrtimeout_range_clock
=> torture_hrtimeout_us
=> rcu_torture_writer
=> kthread
=> ret_from_fork
=> ret_from_fork_asm
rcu_tort-148 1dN.3. 20531758110us : dl_server_start <-enqueue_task_fair
rcu_tort-148 1dN.2. 20531758110us : dl_server_queue: cpu=1
level=2 enqueue=1
rcu_tort-148 1dN.3. 20531760934us : <stack trace>
=> trace_event_raw_event_dl_server_queue
=> enqueue_dl_entity
=> dl_server_start
=> enqueue_task_fair
=> enqueue_task
=> activate_task
=> attach_task
=> sched_balance_rq
=> sched_balance_newidle.constprop.0
=> pick_next_task_fair
=> __schedule
=> schedule
=> schedule_hrtimeout_range_clock
=> torture_hrtimeout_us
=> rcu_torture_writer
=> kthread
=> ret_from_fork
=> ret_from_fork_asm

[1] - https://lore.kernel.org/lkml/571b2045-320d-4ac2-95db-1423d0277613@xxxxxxx/
[2] - https://lore.kernel.org/lkml/20241213032244.877029-1-vineeth@xxxxxxxxxxxxxxx/

> Just let me know!
>
> Thanx, Paul

Tomas