Re: scheduler problems in -next (was: Re: [PATCH 6.4 000/227] 6.4.7-rc1 review)

From: Guenter Roeck
Date: Wed Aug 02 2023 - 11:45:14 EST


On 8/2/23 08:05, Paul E. McKenney wrote:
On Wed, Aug 02, 2023 at 02:57:56PM +0100, Roy Hopkins wrote:
On Tue, 2023-08-01 at 12:11 -0700, Paul E. McKenney wrote:
On Tue, Aug 01, 2023 at 10:32:45AM -0700, Guenter Roeck wrote:


Please see below for my preferred fix.  Does this work for you guys?

Back to figuring out why recent kernels occasionally to blow up all
rcutorture guest OSes...

                                                        Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 7294be62727b..2d5b8385c357 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -570,10 +570,12 @@ static void rcu_tasks_one_gp(struct rcu_tasks *rtp, bool midboot)
        if (unlikely(midboot)) {
                needgpcb = 0x2;
        } else {
+               mutex_unlock(&rtp->tasks_gp_mutex);
                set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
                rcuwait_wait_event(&rtp->cbs_wait,
                                   (needgpcb = rcu_tasks_need_gpcb(rtp)),
                                   TASK_IDLE);
+               mutex_lock(&rtp->tasks_gp_mutex);
        }
        if (needgpcb & 0x2) {

Your preferred fix looks good to me.

With the original code I can quite easily reproduce the problem on my
system every 10 reboots or so. With your fix in place the problem no
longer occurs.

Very good, thank you! May I add your Tested-by?


FWIW, I am still working on it. So far I get

[ 8.191589] KTAP version 1
[ 8.191769] # Subtest: kunit_executor_test
[ 8.191972] # module: kunit
[ 8.192012] 1..8
[ 8.197643] ok 1 parse_filter_test
[ 8.201851] ok 2 filter_suites_test
[ 8.206713] ok 3 filter_suites_test_glob_test
[ 8.211806] ok 4 filter_suites_to_empty_test
[ 8.214077] kunit executor: filter operation not found: speed>slow, module!=example
[ 8.217933] # parse_filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:126
[ 8.217933] Expected err == 0, but
[ 8.217933] err == -22 (0xffffffffffffffea)
[ 8.217933]
[ 8.217933] failed to parse filter '(efault)'
[ 8.221266] not ok 5 parse_filter_attr_test
[ 8.224224] kunit executor: filter operation not found: speed>slow
[ 8.225837] # filter_attr_test: ASSERTION FAILED at lib/kunit/executor_test.c:165
[ 8.225837] Expected err == 0, but
[ 8.225837] err == -22 (0xffffffffffffffea)
[ 8.228850] not ok 6 filter_attr_test
[ 8.230942] kunit executor: filter operation not found: module!=dummy
[ 8.232167] # filter_attr_empty_test: ASSERTION FAILED at lib/kunit/executor_test.c:190
[ 8.232167] Expected err == 0, but
[ 8.232167] err == -22 (0xffffffffffffffea)
[ 8.235317] not ok 7 filter_attr_empty_test
[ 8.237065] kunit executor: filter operation not found: speed>slow
[ 8.238796] # filter_attr_skip_test: ASSERTION FAILED at lib/kunit/executor_test.c:209
[ 8.238796] Expected err == 0, but
[ 8.238796] err == -22 (0xffffffffffffffea)
[ 8.241897] not ok 8 filter_attr_skip_test
[ 8.241947] # kunit_executor_test: pass:4 fail:4 skip:0 total:8
[ 8.242144] # Totals: pass:4 fail:4 skip:0 total:8

and it looks like the console no longer works. Most likely this is some other problem
that was introduced while tests were broken. It will take me some time to track that down.

Guenter