Re: [PATCH v6 1/2] drivers: perf: Add LLC-TAD perf counter support
From: Will Deacon
Date: Tue Oct 26 2021 - 05:44:07 EST
On Mon, Oct 18, 2021 at 09:00:56PM +0530, Bhaskara Budiredla wrote:
> This driver adds support for Last-level cache tag-and-data unit
> (LLC-TAD) PMU that is featured in some of the Marvell's CN10K
> infrastructure silicons.
>
> The LLC is divided into 2N slices distributed across N Mesh tiles
> in a single-socket configuration. The driver always configures the
> same counter for all of the TADs. The user would end up effectively
> reserving one of eight counters in every TAD to look across all TADs.
> The occurrences of events are aggregated and presented to the user
> at the end of an application run. The driver does not provide a way
> for the user to partition TADs so that different TADs are used for
> different applications.
Is that something you will want to do in the future? If you go with your
current approach of exposing a single "tad" unit to userspace, then you
won't be able to change that.
For the L3 PMUs (including on TX2). we expose per-node PMUs so why shouldn't
we do something similar here and expose each TAD region separately? Even if
userspace drives them all together, it gives you more flexibility in the
future if you _do_ want to be partition them up.
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index 5260b116c7da..2db5418d5b0a 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
> obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
> obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
> obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
> +obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
> diff --git a/drivers/perf/marvell_cn10k_tad_pmu.c b/drivers/perf/marvell_cn10k_tad_pmu.c
> new file mode 100644
> index 000000000000..aebb1a0028dc
> --- /dev/null
> +++ b/drivers/perf/marvell_cn10k_tad_pmu.c
> @@ -0,0 +1,429 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Marvell CN10K LLC-TAD perf driver
> + *
> + * Copyright (C) 2021 Marvell
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#define pr_fmt(fmt) "tad_pmu: " fmt
> +
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_address.h>
> +#include <linux/of_device.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +
> +#define TAD_PFC_OFFSET 0x0
> +#define TAD_PFC(counter) (TAD_PFC_OFFSET | (counter << 3))
> +#define TAD_PRF_OFFSET 0x100
> +#define TAD_PRF(counter) (TAD_PRF_OFFSET | (counter << 3))
> +#define TAD_PRF_CNTSEL_MASK 0xFF
> +#define TAD_MAX_COUNTERS 8
> +
> +#define to_tad_pmu(p) (container_of(p, struct tad_pmu, pmu))
> +
> +struct tad_region {
> + void __iomem *base;
> +};
> +
> +struct tad_pmu {
> + struct pmu pmu;
> + struct tad_region *regions;
> + u32 region_cnt;
> + unsigned int cpu;
> + struct hlist_node node;
> + struct perf_event *events[TAD_MAX_COUNTERS];
> + DECLARE_BITMAP(counters_map, TAD_MAX_COUNTERS);
> +};
> +
> +static int tad_pmu_cpuhp_state;
> +
> +static void tad_pmu_event_counter_read(struct perf_event *event)
> +{
> + struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> + u32 counter_idx = hwc->idx;
> + u64 delta, prev, new;
> + int i;
> +
> + do {
> + prev = local64_read(&hwc->prev_count);
> + for (i = 0, new = 0; i < tad_pmu->region_cnt; i++)
> + new += readq(tad_pmu->regions[i].base +
> + TAD_PFC(counter_idx));
> + } while (local64_cmpxchg(&hwc->prev_count, prev, new) != prev);
If we expose each TAD individually, then this won't matter, but I'd be
inclined to move the counter summation outside of the cmpxchg() loop
given that readq (why not _relaxed?) is probably quite slow.
> +
> + delta = (new - prev) & GENMASK_ULL(63, 0);
This mask doesn't do anything.
> + local64_add(delta, &event->count);
> +}
> +
> +static void tad_pmu_event_counter_stop(struct perf_event *event, int flags)
> +{
> + struct tad_pmu *tad_pmu = to_tad_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> + u32 counter_idx = hwc->idx;
> + int i;
> +
> + /* TAD()_PFC() stop counting on the write
> + * which sets TAD()_PRF()[CNTSEL] == 0
> + */
> + for (i = 0; i < tad_pmu->region_cnt; i++)
> + writeq_relaxed(0, tad_pmu->regions[i].base +
> + TAD_PRF(counter_idx));
Please use braces around a multi-line conditional statement.
Will