Re: [BUG almost bisected] Splat in dequeue_rt_stack() and build error
From: Paul E. McKenney
Date: Mon Dec 16 2024 - 14:36:34 EST
On Mon, Dec 16, 2024 at 03:38:20PM +0100, Tomas Glozar wrote:
> ne 15. 12. 2024 v 19:41 odesílatel Paul E. McKenney <paulmck@xxxxxxxxxx> napsal:
> >
> > And the fix for the TREE03 too-short grace periods is finally in, at
> > least in prototype form:
> >
> > https://lore.kernel.org/all/da5065c4-79ba-431f-9d7e-1ca314394443@paulmck-laptop/
> >
> > Or this commit on -rcu:
> >
> > 22bee20913a1 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> >
> > This passes more than 30 hours of 400 concurrent instances of rcutorture's
> > TREE03 scenario, with modifications that brought the bug reproduction
> > rate up to 50 per hour. I therefore have strong reason to believe that
> > this fix is a real fix.
> >
> > With this fix in place, a 20-hour run of 400 concurrent instances
> > of rcutorture's TREE03 scenario resulted in 50 instances of the
> > enqueue_dl_entity() splat pair. One (untrimmed) instance of this pair
> > of splats is shown below.
> >
> > You guys did reproduce this some time back, so unless you tell me
> > otherwise, I will assume that you have this in hand. I would of course
> > be quite happy to help, especially with adding carefully chosen debug
> > (heisenbug and all that) or testing of alleged fixes.
> >
>
> The same splat was recently reported to LKML [1] and a patchset was
> sent and merged into tip/sched/urgent that fixes a few bugs around
> double-enqueue of the deadline server [2]. I'm currently re-running
> TREE03 with those patches, hoping they will also fix this issue.
Thank you very much!
An initial four-hour test of 400 instances of an enhanced TREE03 ran
error-free. I would have expected about 10 errors, so this gives me
99.9+% confidence that the patches improved things at least a little
bit and 99% confidence that these patches reduced the error rate by at
least a factor of two.
I am starting an overnight run. If that completes without error, this
will provide 99% confidence that these patches reduced the error rate
by at least an order of magnitude.
> Also, last week I came up with some more extensive tracing, which
> showed dl_server_update and dl_server_start happening right after each
> other, apparently during the same run of enqueue_task_fair (see
> below). I'm currently looking into that to figure out whether the
> mechanism shown by the trace is fixed by the patchset.
And for whatever it is worth, judging by my console output, I am seeing
something similar.
Thanx, Paul
> --------------------------
>
> rcu_tort-148 1dN.3. 20531758076us : dl_server_stop <-dequeue_entities
> rcu_tort-148 1dN.2. 20531758076us : dl_server_queue: cpu=1
> level=2 enqueue=0
> rcu_tort-148 1dN.3. 20531758078us : <stack trace>
> => trace_event_raw_event_dl_server_queue
> => dl_server_stop
> => dequeue_entities
> => dequeue_task_fair
> => __schedule
> => schedule
> => schedule_hrtimeout_range_clock
> => torture_hrtimeout_us
> => rcu_torture_writer
> => kthread
> => ret_from_fork
> => ret_from_fork_asm
> rcu_tort-148 1dN.3. 20531758095us : dl_server_update <-update_curr
> rcu_tort-148 1dN.3. 20531758097us : dl_server_update <-update_curr
> rcu_tort-148 1dN.2. 20531758101us : dl_server_queue: cpu=1
> level=2 enqueue=1
> rcu_tort-148 1dN.3. 20531758103us : <stack trace>
> rcu_tort-148 1dN.2. 20531758104us : dl_server_queue: cpu=1
> level=1 enqueue=1
> rcu_tort-148 1dN.3. 20531758106us : <stack trace>
> rcu_tort-148 1dN.2. 20531758106us : dl_server_queue: cpu=1
> level=0 enqueue=1
> rcu_tort-148 1dN.3. 20531758108us : <stack trace>
> => trace_event_raw_event_dl_server_queue
> => rb_insert_color
> => enqueue_dl_entity
> => update_curr_dl_se
> => update_curr
> => enqueue_task_fair
> => enqueue_task
> => activate_task
> => attach_task
> => sched_balance_rq
> => sched_balance_newidle.constprop.0
> => pick_next_task_fair
> => __schedule
> => schedule
> => schedule_hrtimeout_range_clock
> => torture_hrtimeout_us
> => rcu_torture_writer
> => kthread
> => ret_from_fork
> => ret_from_fork_asm
> rcu_tort-148 1dN.3. 20531758110us : dl_server_start <-enqueue_task_fair
> rcu_tort-148 1dN.2. 20531758110us : dl_server_queue: cpu=1
> level=2 enqueue=1
> rcu_tort-148 1dN.3. 20531760934us : <stack trace>
> => trace_event_raw_event_dl_server_queue
> => enqueue_dl_entity
> => dl_server_start
> => enqueue_task_fair
> => enqueue_task
> => activate_task
> => attach_task
> => sched_balance_rq
> => sched_balance_newidle.constprop.0
> => pick_next_task_fair
> => __schedule
> => schedule
> => schedule_hrtimeout_range_clock
> => torture_hrtimeout_us
> => rcu_torture_writer
> => kthread
> => ret_from_fork
> => ret_from_fork_asm
>
> [1] - https://lore.kernel.org/lkml/571b2045-320d-4ac2-95db-1423d0277613@xxxxxxx/
> [2] - https://lore.kernel.org/lkml/20241213032244.877029-1-vineeth@xxxxxxxxxxxxxxx/
>
> > Just let me know!
> >
> > Thanx, Paul
>
> Tomas
>