Re: [PATCH linux-next][RFC] powerpc: fix HOTPLUG error in rcutorture

From: Zhouyi Zhou
Date: Mon Oct 10 2022 - 21:59:46 EST


Thanks Michael for reviewing my patch

On Mon, Oct 10, 2022 at 7:21 PM Michael Ellerman <mpe@xxxxxxxxxxxxxx> wrote:
>
> Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> writes:
> > I think we should avoid torture offline the cpu who do tick timer
> > when nohz full is running.
>
> Can you tell us what the bug you're fixing is?
>
> Did you see a crash/oops/hang etc? Or are you just proposing this as
> something that would be a good idea?
Sorry for the trouble and inconvenience that I bring to the community.
I haven't made myself clear in my patch.
The ins and outs are as follows:
1) cd linux-next
2) ./tools/testing/selftests/rcutorture/bin/torture.sh
after 19 hours ;-)
3) tail ./tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture/results-scftorture/NOPREEMPT/console.log

[ 121.449268][ T57] scftorture: scf_invoked_count VER: 2415215
resched: 697463 single: 619512/619760 single_ofl: 255751/256554
single_rpc: 620692 single_rpc_ofl: 0 many: 155476/154658 all:
77282/76988 onoff: 3/3:5/6 18,25:9,28 63:93 (HZ=100) ste: 0 stnmie: 0
stnmoe: 0 staf: 0
[ 121.454485][ T57] scftorture: --- End of test: LOCK_HOTPLUG:
verbose=1 holdoff=10 longwait=0 nthreads=4 onoff_holdoff=30
onoff_interval=1000 shutdown_secs=1 stat_interval=15 stutter=5
use_cpus_read_lock=0, weight_resched=-1, weight_single=-1,
weight_single_rpc=-1, weight_single_wait=-1, weight_many=-1,
weight_many_wait=-1, weight_all=-1, weight_all_wait=-1
[ 121.469305][ T57] reboot: Power down

I see "End of test: LOCK_HOTPLUG", which means the function
torture_offline in kernel torture.c failed to bring down the cpu.
4) Then I chase the reason down to tick_nohz_cpu_down:
if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
return -EBUSY;
5) I create above patch
>
> > Tested on PPC VM of Open Source Lab of Oregon State University.
> > The test results show that after the fix, the success rate of
> > rcutorture is improved.
> > After:
> > Successes: 40 Failures: 9
> > Before:
> > Successes: 38 Failures: 11
> >
> > I examined the console.log and Make.out files one by one, no new
> > compile error or test error is introduced by above fix.
> >
> > Signed-off-by: Zhouyi Zhou <zhouzhouyi@xxxxxxxxx>
> > ---
> > Dear PPC developers
> >
> > I found this bug when trying to do rcutorture tests in ppc VM of
> > Open Source Lab of Oregon State University:
> >
> > ubuntu@ubuntu:~/linux-next/tools/testing/selftests/rcutorture/res/2022.09.30-01.06.22-torture$ find . -name "console.log.diags"|xargs grep HOTPLUG
> > ./results-scftorture/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> > ./results-rcutorture/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> > ./results-rcutorture/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> > ./results-scftorture-kasan/NOPREEMPT/console.log.diags:WARNING: HOTPLUG FAILURES NOPREEMPT
> > ./results-rcutorture-kasan/TASKS03/console.log.diags:WARNING: HOTPLUG FAILURES TASKS03
> > ./results-rcutorture-kasan/TREE04/console.log.diags:WARNING: HOTPLUG FAILURES TREE04
> >
> > I tried to fix this bug.
> >
> > Thanks for your patience and guidance ;-)
> >
> > Thanks
> > Zhouyi
> > --
> > arch/powerpc/kernel/sysfs.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > index ef9a61718940..be9c0e45337e 100644
> > --- a/arch/powerpc/kernel/sysfs.c
> > +++ b/arch/powerpc/kernel/sysfs.c
> > @@ -4,6 +4,7 @@
> > #include <linux/smp.h>
> > #include <linux/percpu.h>
> > #include <linux/init.h>
> > +#include <linux/tick.h>
> > #include <linux/sched.h>
> > #include <linux/export.h>
> > #include <linux/nodemask.h>
> > @@ -21,6 +22,7 @@
> > #include <asm/firmware.h>
> > #include <asm/idle.h>
> > #include <asm/svm.h>
> > +#include "../../../kernel/time/tick-internal.h"
>
> Needing to include this internal header is a sign that we are using the
> wrong API or otherwise using time keeping internals we shouldn't be.
Yes, when I do this, I guess there is something wrong in my patch.
>
> > #include "cacheinfo.h"
> > #include "setup.h"
> > @@ -1151,7 +1153,11 @@ static int __init topology_init(void)
> > * CPU. For instance, the boot cpu might never be valid
> > * for hotplugging.
> > */
> > - if (smp_ops && smp_ops->cpu_offline_self)
> > + if (smp_ops && smp_ops->cpu_offline_self
> > +#ifdef CONFIG_NO_HZ_FULL
> > + && !(tick_nohz_full_running && tick_do_timer_cpu == cpu)
> > +#endif
> > + )
>
> I can't see any other arches doing anything like this. I don't think
> it's the arches responsibility.
Agree!

X86 seems to disable CPU0's hotplug by default, while
tick_do_timer_cpu has a default value 0.

42 #ifdef CONFIG_BOOTPARAM_HOTPLUG_CPU0
43 static int cpu0_hotpluggable = 1;
44 #else
45 static int cpu0_hotpluggable;
46 static int __init enable_cpu0_hotplug(char *str)
47 {
48 cpu0_hotpluggable = 1;
49 return 1;
50 }
51
52 __setup("cpu0_hotplug", enable_cpu0_hotplug);
53 #endif

I need more time to make clear the relationship of X86's
cpu0_hotpluggable and tick_do_timer_cpu, but
I also intend to think it's time keeping the mechanism's responsibility.


>
> If the time keeping core needs a CPU to stay online to run the timer
> then it needs to organise that itself IMHO :)

Um, I am going to submit a patch to time keeping community sometime
next month ;-)

Thanks again
Cheers
Zhouyi
>
> cheers
>
> > c->hotpluggable = 1;
> > #endif
> >
> > --
> > 2.25.1