Re: [PATCH 0/9] perf: Avoid explicit cpumask var allocation from stack

From: Mark Rutland
Date: Tue Apr 02 2024 - 10:48:36 EST


On Tue, Apr 02, 2024 at 09:40:08PM +0800, Dawei Li wrote:
> Hi Mark,
>
> Thanks for the quick review.
>
> On Tue, Apr 02, 2024 at 12:12:50PM +0100, Mark Rutland wrote:
> > On Tue, Apr 02, 2024 at 06:56:01PM +0800, Dawei Li wrote:
> > > Hi,
> > >
> > > This series try to eliminate direct cpumask var allocation from stack
> > > for perf subsystem.
> > >
> > > Direct/explicit allocation of cpumask on stack could be dangerous since
> > > it can lead to stack overflow for systems with big NR_CPUS or
> > > CONFIG_CPUMASK_OFFSTACK=y.
> > >
> > > For arm64, it's more urgent since commit 3fbd56f0e7c1 ("ARM64: Dynamically
> > > allocate cpumasks and increase supported CPUs to 512").
> > >
> > > It's sort of a pattern that almost every cpumask var in perf subystem
> > > occurs in teardown callback of cpuhp. In which case, if dynamic
> > > allocation failed(which is unlikely), we choose return 0 rather than
> > > -ENOMEM to caller cuz:
> > > @teardown is not supposed to fail and if it does, system crashes:
> >
> > .. but we've left the system in an incorrect state, so that makes no sense.
> >
> > As I commented on the first patch, NAK to dynamically allocating cpumasks in
> > the CPUHP callbacks. Please allocate the necessry cpumasks up-front when we
> > probe the PMU. At that time we can handle an allocation failure by cleaning up
> > and failing to probe the PMU, and then the CPUHP callbacks don't need to
> > allocate memory to offline a CPU...
>
> Agreed that dynamically allocation in callbacks lead to inconsistency
> to system.
>
> My (original)alternative plan is simple but ugly, just make cpumask var
> _static_ and add extra static lock to protect it.
>
> The only difference between solution above and your proposal is static/
> dynamic alloction. CPUHP's teardown cb is supposed to run in targetted
> cpuhp thread for most cases, and it's racy. Even the cpumask var is
> wrapped in dynamically allocated struct xxx_pmu, it's still shareable
> between different threads/contexts and needs proper protection.

I was under the impression that the cpuhp callbacks were *strictly* serialised.
If that's not the case, the existing offlining callbacks are horrendously
broken.

Are you *certain* these can race?

Regardless, adding additional locking here is not ok.

> Simple as this(_untested_):
>
> diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
> index 7ef9c7e4836b..fa89c3db4d7d 100644
> --- a/drivers/perf/arm-cmn.c
> +++ b/drivers/perf/arm-cmn.c
> @@ -1950,18 +1950,24 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
> struct arm_cmn *cmn;
> unsigned int target;
> int node;
> - cpumask_t mask;
> + static cpumask_t mask;
> + static DEFINE_SPINLOCK(cpumask_lock);
>
> cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
> if (cpu != cmn->cpu)
> return 0;
>
> + spin_lock(&cpumask_lock);
> +
> node = dev_to_node(cmn->dev);
> if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
> cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
> target = cpumask_any(&mask);
> else
> target = cpumask_any_but(cpu_online_mask, cpu);
> +
> + spin_unlock(&cpumask_lock);
> +
> if (target < nr_cpu_ids)
> arm_cmn_migrate(cmn, target);
> return 0;

Looking at this case, the only reason we need the mask is because it made the
logic a little easier to write. All we really want is to choose some CPU in the
intersection of two masks ignoring a specific CPU, and there was no helper
function to do that.

We can add a new helper to do that for us, which would avoid redundant work to
manipulate the entire mask, and it would make the existing code simpler. I had
a series a few years back to add cpumask_any_and_but():

https://lore.kernel.org/lkml/1486381132-5610-1-git-send-email-mark.rutland@xxxxxxx/

.. and that's easy to use here, e.g.

| diff --git a/drivers/perf/arm-cmn.c b/drivers/perf/arm-cmn.c
| index 7ef9c7e4836b7..c6bbd387ccf8b 100644
| --- a/drivers/perf/arm-cmn.c
| +++ b/drivers/perf/arm-cmn.c
| @@ -1950,17 +1950,15 @@ static int arm_cmn_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_no
| struct arm_cmn *cmn;
| unsigned int target;
| int node;
| - cpumask_t mask;
|
| cmn = hlist_entry_safe(cpuhp_node, struct arm_cmn, cpuhp_node);
| if (cpu != cmn->cpu)
| return 0;
|
| node = dev_to_node(cmn->dev);
| - if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
| - cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
| - target = cpumask_any(&mask);
| - else
| + target = cpumask_any_and_but(cpu_online_mask, cpumask_of_node(node),
| + cpu);
| + if (target >= nr_cpu_ids)
| target = cpumask_any_but(cpu_online_mask, cpu);
| if (target < nr_cpu_ids)
| arm_cmn_migrate(cmn, target);

It doesn't trivially rebase since the cpumask code has changed a fair amount,
but I've managed to do that locally, and I can send that out as a
seven-years-late v2 if it's useful.

>From a quick scan, it looks like that'd handle all cases in this series. Are
there any patterns in this series for which that would not be sufficient?

Mark.

>
> And yes, static allocation is evil :)
>
>
> Thanks,
>
> Dawei
>
> >
> > Also, for the titles it'd be better to say something like "avoid placing
> > cpumasks on the stack", because "explicit cpumask var allocation" sounds like
> > the use of alloc_cpumask_var().
>
> Sound great! I will update it.
>
> >
> > Mark.
> >
> > >
> > > static int cpuhp_issue_call(int cpu, enum cpuhp_state state, bool bringup,
> > > struct hlist_node *node)
> > > {
> > > struct cpuhp_step *sp = cpuhp_get_step(state);
> > > int ret;
> > >
> > > /*
> > > * If there's nothing to do, we done.
> > > * Relies on the union for multi_instance.
> > > */
> > > if (cpuhp_step_empty(bringup, sp))
> > > return 0;
> > > /*
> > > * The non AP bound callbacks can fail on bringup. On teardown
> > > * e.g. module removal we crash for now.
> > > */
> > > #ifdef CONFIG_SMP
> > > if (cpuhp_is_ap_state(state))
> > > ret = cpuhp_invoke_ap_callback(cpu, state, bringup, node);
> > > else
> > > ret = cpuhp_invoke_callback(cpu, state, bringup, node,
> > > NULL);
> > > #else
> > > ret = cpuhp_invoke_callback(cpu, state, bringup, node, NULL);
> > > #endif
> > > BUG_ON(ret && !bringup);
> > > return ret;
> > > }
> > >
> > > Dawei Li (9):
> > > perf/alibaba_uncore_drw: Avoid explicit cpumask var allocation from
> > > stack
> > > perf/arm-cmn: Avoid explicit cpumask var allocation from stack
> > > perf/arm_cspmu: Avoid explicit cpumask var allocation from stack
> > > perf/arm_dsu: Avoid explicit cpumask var allocation from stack
> > > perf/dwc_pcie: Avoid explicit cpumask var allocation from stack
> > > perf/hisi_pcie: Avoid explicit cpumask var allocation from stack
> > > perf/hisi_uncore: Avoid explicit cpumask var allocation from stack
> > > perf/qcom_l2: Avoid explicit cpumask var allocation from stack
> > > perf/thunder_x2: Avoid explicit cpumask var allocation from stack
> > >
> > > drivers/perf/alibaba_uncore_drw_pmu.c | 13 +++++++++----
> > > drivers/perf/arm-cmn.c | 13 +++++++++----
> > > drivers/perf/arm_cspmu/arm_cspmu.c | 13 +++++++++----
> > > drivers/perf/arm_dsu_pmu.c | 18 +++++++++++++-----
> > > drivers/perf/dwc_pcie_pmu.c | 17 +++++++++++------
> > > drivers/perf/hisilicon/hisi_pcie_pmu.c | 15 ++++++++++-----
> > > drivers/perf/hisilicon/hisi_uncore_pmu.c | 13 +++++++++----
> > > drivers/perf/qcom_l2_pmu.c | 15 ++++++++++-----
> > > drivers/perf/thunderx2_pmu.c | 20 ++++++++++++--------
> > > 9 files changed, 92 insertions(+), 45 deletions(-)
> > >
> > >
> > > Thanks,
> > >
> > > Dawei
> > >
> > > --
> > > 2.27.0
> > >
> >