Re: [PATCH v8 15/19] arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system

From: Qais Yousef
Date: Fri Jun 04 2021 - 08:14:37 EST


On 06/04/21 10:49, Mark Rutland wrote:
> On Thu, Jun 03, 2021 at 06:40:57PM +0100, Will Deacon wrote:
> > On Thu, Jun 03, 2021 at 01:58:56PM +0100, Mark Rutland wrote:
> > > On Wed, Jun 02, 2021 at 05:47:15PM +0100, Will Deacon wrote:
> > > > If we want to support 32-bit applications, then when we identify a CPU
> > > > with mismatched 32-bit EL0 support we must ensure that we will always
> > > > have an active 32-bit CPU available to us from then on. This is important
> > > > for the scheduler, because is_cpu_allowed() will be constrained to 32-bit
> > > > CPUs for compat tasks and forced migration due to a hotplug event will
> > > > hang if no 32-bit CPUs are available.
> > > >
> > > > On detecting a mismatch, prevent offlining of either the mismatching CPU
> > > > if it is 32-bit capable, or find the first active 32-bit capable CPU
> > > > otherwise.
> > > >
> > > > Reviewed-by: Catalin Marinas <catalin.marinas@xxxxxxx>
> > > > Signed-off-by: Will Deacon <will@xxxxxxxxxx>
> > > > ---
> > > > arch/arm64/kernel/cpufeature.c | 20 +++++++++++++++++++-
> > > > 1 file changed, 19 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> > > > index 4194a47de62d..b31d7a1eaed6 100644
> > > > --- a/arch/arm64/kernel/cpufeature.c
> > > > +++ b/arch/arm64/kernel/cpufeature.c
> > > > @@ -2877,15 +2877,33 @@ void __init setup_cpu_features(void)
> > > >
> > > > static int enable_mismatched_32bit_el0(unsigned int cpu)
> > > > {
> > > > + static int lucky_winner = -1;
> > >
> > > This is cute, but could we please give it a meaningful name, e.g.
> > > `pinned_cpu` ?
> >
> > I really don't see the problem, nor why it's "cute".
> >
> > Tell you what, I'll add a comment instead:
> >
> > /*
> > * The first 32-bit-capable CPU we detected and so can no longer
> > * be offlined by userspace. -1 indicates we haven't yet onlined
> > * a 32-bit-capable CPU.
> > */
>
> Thanks for the comment; that's helpful.
>
> However, my concern here is that when we inevitably have to discuss this
> with others in future, "lucky winner" is jarring (and also unclear to
> those where English is not their native language). For clarity, it would
> be really nice to use a term like "cpu", "chosen_cpu", "pinned_cpu",
> etc.
>
> However, you're the maintainer; choose what you think is appropriate.
>
> > > > struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> > > > bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
> > > >
> > > > if (cpu_32bit) {
> > > > cpumask_set_cpu(cpu, cpu_32bit_el0_mask);
> > > > static_branch_enable_cpuslocked(&arm64_mismatched_32bit_el0);
> > > > - setup_elf_hwcaps(compat_elf_hwcaps);
> > > > }
> > > >
> > > > + if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
> > > > + return 0;
> > > > +
> > > > + if (lucky_winner >= 0)
> > > > + return 0;
> > > > +
> > > > + /*
> > > > + * We've detected a mismatch. We need to keep one of our CPUs with
> > > > + * 32-bit EL0 online so that is_cpu_allowed() doesn't end up rejecting
> > > > + * every CPU in the system for a 32-bit task.
> > > > + */
> > > > + lucky_winner = cpu_32bit ? cpu : cpumask_any_and(cpu_32bit_el0_mask,
> > > > + cpu_active_mask);
> > > > + get_cpu_device(lucky_winner)->offline_disabled = true;
> > > > + setup_elf_hwcaps(compat_elf_hwcaps);
> > > > + pr_info("Asymmetric 32-bit EL0 support detected on CPU %u; CPU hot-unplug disabled on CPU %u\n",
> > > > + cpu, lucky_winner);
> > > > return 0;
> > > > }
> > >
> > > I guess this is going to play havoc with kexec and hibernate. :/
> >
> > The kernel can still offline the CPUs (see the whole freezer mess that I
> > linked to in the cover letter). What specific havoc are you thinking of?
>
> Ah. If this is just inhibiting userspace-driven offlining, that sounds
> fine.
>
> For kexec, I was concerned that either this would inhibit kexec, or
> smp_shutdown_nonboot_cpus() would fail to offline the pinned CPU, and
> that'd trigger a BUG(), which would be unfortunate.
>
> For hibernate, the equivalent is freeze_secondary_cpus(), which I guess
> is dealt with by the freezer bits you mention.

()->offline_disabled will only block offline requests performed by
device_offline(). kexec, hibernate, suspend/resume use cpu_online/offline()
directly so won't be impacted by that. I have sent patches that make
cpu_online/offline() 'private' and not used or exported outside of cpu
subsystem and the odd support function for arch code. All other users use
device_offline() or add/remove_cpu() now.

I have made sure to test kexec, suspend to disk and ram in my older similar
implementation in the past. So we should be good.

Cheers

--
Qais Yousef