Re: PROBLEM: Only one CPU active on Ultra 60 since ~4.8 (regression)
From: Linus Torvalds
Date: Thu Mar 28 2024 - 16:09:35 EST
On Thu, 28 Mar 2024 at 12:36, Linux regression tracking (Thorsten
Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>
> [CCing Linus, in case I say something to his disliking]
>
> On 22.03.24 05:57, Nick Bowler wrote:
> >
> > Just a friendly reminder that this issue still happens on Linux 6.8 and
> > reverting commit 9b2f753ec237 as indicated below is still sufficient to
> > resolve the problem.
>
> FWIW, that commit 9b2f753ec23710 ("sparc64: Fix cpu_possible_mask if
> nr_cpus is set") is from v4.8. Reverting it after all that time might
> easily lead to even bigger trouble.
I'm definitely not reverting a patch from almost a decade ago as a regression.
If it took that long to find, it can't be that critical of a regression.
So yes, let's treat it as a regular bug. And let's bring in Andreas to
the discussion too (although presumably he has seen it on the
sparclinux mailing list).
Andreas, if not, here's the link to lore for the beginning of the thread:
https://lore.kernel.org/all/CADyTPEwt=ZNams+1bpMB1F9w_vUdPsGCt92DBQxxq_VtaLoTdw@xxxxxxxxxxxxxx/
And from a quick look I do think that commit is buggy, and yes, the
fix probably is just be to revert it.
As the original report makes clear, that commit 9b2f753ec23710 is
clearly confused about the difference between "number of CPU's", and
"index of CPU numbers".
When that smp_fill_in_cpu_possible_map() does
int possible_cpus = num_possible_cpus();
and then uses that to fill in &__cpu_possible_mask, that's completely
nonsensical. Because we literally have
#define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
#define num_possible_cpus() cpumask_weight(cpu_possible_mask)
so it's reading cpu_possible_mask to figure out how many cpus it might
have, and then using that number to set possibly *different* bits in
the same bitmap that is just used to judge what the max number is.
So I do think a revert is called for, but I'm not going to treat this
as a regression, I'm going to just treat it as "sparc bug" and hope
that the sparc people try to figure out why that crazy code was
written.
And maybe it made more sense back a decade ago than it does now.
Andreas?
Linus