Re: [PATCH 0/5] cpufreq: cppc: Fix suspend/resume specific races with FIE code
From: Viresh Kumar
Date: Wed Jun 16 2021 - 00:57:15 EST
On 15-06-21, 08:17, Qian Cai wrote:
> On 6/15/2021 3:50 AM, Viresh Kumar wrote:
> > This is a strange place to get the issue from. And this is a new
> > issue.
>
> Well, it was still the same exercises with CPU online/offline.
>
> >
> >> [ 488.151939][ T670] kthread+0x3ac/0x460
> >> [ 488.155854][ T670] ret_from_fork+0x10/0x18
> >> [ 488.160120][ T670] Code: 911e8000 aa1303e1 910a0000 941b595b (d4210000)
> >> [ 488.166901][ T670] ---[ end trace e637e2d38b2cc087 ]---
> >> [ 488.172206][ T670] Kernel panic - not syncing: Oops - BUG: Fatal exception
> >> [ 488.179182][ T670] SMP: stopping secondary CPUs
> >> [ 489.209347][ T670] SMP: failed to stop secondary CPUs 0-1,10-11,16-17,31
> >> [ 489.216128][ T][ T670] Memoryn ]---
> >
> > Can you give details on what exactly did you try to do, to get this ?
> > Normal boot or something more ?
>
> Basically, it has the cpufreq driver as CPPC and the governor as
> schedutil. Running a few workloads to get CPU scaling up and down.
> Later, try to offline all CPUs until the last one and then online
> all CPUs.
Hmm, okay.
So I basically have very similar setup with 8 cores (1-policy
per-cpu), the only difference is I don't end up reading the
performance counters, everything else remains same. So I should see
issues now just like you, in case there are any.
Since the insmod/rmmod setup is a bit different, this is what I tried
today for around an hour with CONFIG_DEBUG_LIST and RCU debugging
options.
while true; do
for i in `seq 1 7`;
do
echo 0 > /sys/devices/system/cpu/cpu$i/online;
done;
for i in `seq 1 7`;
do
echo 1 > /sys/devices/system/cpu/cpu$i/online;
done;
done
I don't see any crashes, oops or warnings with latest stuff.
> I am hesitate to try this at the moment because this all feel like
> shooting in the dark.
I understand your point and you aren't completely wrong here. It
wasn't completely in dark but since I am unable to reproduce the issue
at my end, I asked for help.
FWIW, I think one of the possible cause of corruption of kthread thing
could have been because of the race in the topology related code. I
already fixed that in my tree yesterday.
> Ideally, you will be able to get access to one
> of those arm64 servers (Huawei, Ampere, TX2, FJ etc) eventually and
> really try the same exercises yourself with those debugging options
> like list debugging and KASAN on. That way you could fix things way
> efficiently.
Yeah, I thought of this work being over and I am not a user of it
normally. I had to enable it for ARM servers and I took help of my
colleagues (Vincent Guittot and Ionela) for testing the same.
I have also asked Vincent to give it a try again.
> I could share you the .config once you are there. Last
> but not least, once you get better narrow down of the issues, I'd
> hope to see someone else familiar with the code there to get review
> of those patches first (feel free to Cc me once you are ready to
> post) before I'll rerun the whole things again. That way we don't
> waste time on each other backing and forth chasing the shadow.
I did send the stuff up for review and this last thing (you reported)
was a different race altogether, so asked for testing without reviews.
Anyway, I am quite sure my tests have covered such issues now. I will
send out patches again soon.
Thanks Qian.
--
viresh