Re: [PATCH v6 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Ganapatrao Kulkarni
Date: Tue Jul 17 2018 - 01:21:23 EST
Hi Mark,
On Fri, Jul 13, 2018 at 6:28 PM, Ganapatrao Kulkarni <gklkml16@xxxxxxxxx> wrote:
> thanks Mark for the comments.
>
> On Thu, Jul 12, 2018 at 10:26 PM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>> Hi Ganapat,
>>
>> On Thu, Jun 21, 2018 at 12:03:38PM +0530, Ganapatrao Kulkarni wrote:
>>> This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
>>> Controller(DMC) and Level 3 Cache(L3C).
>>>
>>> ThunderX2 has 8 independent DMC PMUs to capture performance events
>>> corresponding to 8 channels of DDR4 Memory Controller and 16 independent
>>> L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
>>> Each PMU supports up to 4 counters. All counters lack overflow interrupt
>>> and are sampled periodically.
>>>
>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@xxxxxxxxxx>
>>> ---
>>> drivers/perf/Kconfig | 8 +
>>> drivers/perf/Makefile | 1 +
>>> drivers/perf/thunderx2_pmu.c | 949 +++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/cpuhotplug.h | 1 +
>>> 4 files changed, 959 insertions(+)
>>> create mode 100644 drivers/perf/thunderx2_pmu.c
>>>
>>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>>> index 08ebaf7..ecedb9e 100644
>>> --- a/drivers/perf/Kconfig
>>> +++ b/drivers/perf/Kconfig
>>> @@ -87,6 +87,14 @@ config QCOM_L3_PMU
>>> Adds the L3 cache PMU into the perf events subsystem for
>>> monitoring L3 cache events.
>>>
>>> +config THUNDERX2_PMU
>>> + bool "Cavium ThunderX2 SoC PMU UNCORE"
>>> + depends on ARCH_THUNDER2 && ARM64 && ACPI
>>
>> Surely this depends on NUMA for the node id?
> it is, i will update.
>
>>
>> [...]
>>
>>> +/*
>>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>>> + * each uncore device has up to 16 channels, each channel can sample
>>> + * events independently with counters up to 4.
>>> + */
>>> +struct thunderx2_pmu_uncore_channel {
>>> + struct pmu pmu;
>>> + struct hlist_node node;
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + int channel;
>>> + int cpu;
>>> + DECLARE_BITMAP(active_counters, UNCORE_MAX_COUNTERS);
>>> + struct perf_event *events[UNCORE_MAX_COUNTERS];
>>> + struct hrtimer hrtimer;
>>> + /* to sync counter alloc/release */
>>> + raw_spinlock_t lock;
>>> +};
>>
>> This lock should not be necessary. The pmu::{add,del} callbacks are
>> strictly serialised w.r.t. one another per CPU because the core perf
>> code holds ctx->lock whenever it calls either.
>>
>> Previously, you mentioned you'd seen scheduling while atomic when this
>> lock was removed. Could you please provide a backtrace?
>
> yes, i have seen, if i try to kick in simulatneously more events than
> h/w counters available(4 here).
> i will try with v6 and send the trace asap.
i dont have the setup on which i have seen this issue. I have tried to
recreate on my current setup and not able to see it as before. I will
remove these locks, if i dont see any issue while sending next
version. Sorry for the noise!
>
>>
>> [...]
>>
>> It would be worth a comment here explaining which resources the channel
>> PMUs share, and why we need this thunderx2_pmu_uncore_dev. IIUC, it's
>> just that they share a register windows switched by FW?
>>
>>> +struct thunderx2_pmu_uncore_dev {
>>> + char *name;
>>> + struct device *dev;
>>> + enum thunderx2_uncore_type type;
>>> + void __iomem *base;
>>> + int node;
>>> + u32 max_counters;
>>> + u32 max_channels;
>>> + u32 max_events;
>>> + u64 hrtimer_interval;
>>> + /* this lock synchronizes across channels */
>>> + raw_spinlock_t lock;
>>> + const struct attribute_group **attr_groups;
>>> + void (*init_cntr_base)(struct perf_event *event,
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev);
>>> + void (*select_channel)(struct perf_event *event);
>>> + void (*stop_event)(struct perf_event *event);
>>> + void (*start_event)(struct perf_event *event, int flags);
>>> +};
>>
>> As a general thing, the really long structure/enum/macro names make
>> things really hard to read.
>>
>> Could we s/THUNDERX2/TX2/ and s/thunderx2/tx2/ for identifiers please?
>
> thanks, will do.
>>
>> Similarly:
>>
>> * s/thunderx2_pmu_uncore_channel/tx2_pmu/
>>
>> * s/thunderx2_pmu_uncore_dev/tx2_pmu_group/
>>
>> ... and consistently name those "pmu" and "pmu_group" respectively --
>> that mekas the relationship between the two much clearer for someone who
>> is not intimately familiar with the hardware.
>
> These PMUs(UNCORE) counters makes only sense to someone who is
> familiar with the hardware.
> IMHO, the current names aligned to hardware design like channels ,
> tiles, counter etc,
> which is explained in PATCH1/2.
>
> I will try to simplify long names.
>
>>
>> That makes things far more legible, e.g.
>>
>>> +static inline struct thunderx2_pmu_uncore_channel *
>>> +pmu_to_thunderx2_pmu_uncore(struct pmu *pmu)
>>> +{
>>> + return container_of(pmu, struct thunderx2_pmu_uncore_channel, pmu);
>>> +}
>>
>> ... becomes:
>>
>> static struct tx2_pmu *pmu_to_tx2_pmu(struct pmu *pmu)
>> {
>> return container_of(pmu, struct tx2_pmu, pmu);
>> }
>>
>> [...]
>>
>>> +/*
>>> + * sysfs cpumask attributes
>>> + */
>>> +static ssize_t cpumask_show(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + struct cpumask cpu_mask;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>>> +
>>> + cpumask_clear(&cpu_mask);
>>> + cpumask_set_cpu(pmu_uncore->cpu, &cpu_mask);
>>> + return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
>>> +}
>>> +static DEVICE_ATTR_RO(cpumask);
>>
>> This can be simplified to:
>>
>> static ssize_t cpumask_show(struct device *dev, struct device_attribute
>> *attr, char *buf)
>> {
>> struct thunderx2_pmu_uncore_channel *pmu_uncore;
>> pmu_uncore = pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>>
>> return cpumap_print_to_pagebuf(true, buf, cpumask_of(pmu_uncore->cpu));
>> }
>
> thanks, will do
>>
>> [...]
>>
>>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + int counter;
>>> +
>>> + raw_spin_lock(&pmu_uncore->lock);
>>> + counter = find_first_zero_bit(pmu_uncore->active_counters,
>>> + pmu_uncore->uncore_dev->max_counters);
>>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> + return -ENOSPC;
>>> + }
>>> + set_bit(counter, pmu_uncore->active_counters);
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> + return counter;
>>> +}
>>> +
>>> +static void free_counter(
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore, int counter)
>>> +{
>>> + raw_spin_lock(&pmu_uncore->lock);
>>> + clear_bit(counter, pmu_uncore->active_counters);
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> +}
>>
>> As above, I still don't believe that these locks are necessary, unless
>> we have a bug elsewhere.
>
> i will try to debug and provide you more info.
>
>>
>> [...]
>>
>>> +/*
>>> + * DMC and L3 counter interface is muxed across all channels.
>>> + * hence we need to select the channel before accessing counter
>>> + * data/control registers.
>>> + *
>>> + * L3 Tile and DMC channel selection is through SMC call
>>> + * SMC call arguments,
>>> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
>>> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
>>> + * x2 = Node id
>>> + * x3 = DMC(1)/L3C(0)
>>> + * x4 = channel Id
>>> + */
>>
>> Please document which ID space the node id is in, e.g.
>>
>> x2 = FW Node ID (matching SRAT/SLIT IDs)
>>
>> ... so that it's clear that this isn't a Linux logical node id, even if
>> those happen to be the same today.
>
> sure.
>>
>> [...]
>>
>>> +static void thunderx2_uncore_start(struct perf_event *event, int flags)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long irqflags;
>>> +
>>> + hwc->state = 0;
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + uncore_dev = pmu_uncore->uncore_dev;
>>> +
>>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>>> + uncore_dev->select_channel(event);
>>> + uncore_dev->start_event(event, flags);
>>> + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
>>> + perf_event_update_userpage(event);
>>> +
>>> + if (!find_last_bit(pmu_uncore->active_counters,
>>> + pmu_uncore->uncore_dev->max_counters)) {
>>
>> This would be clearer using !bitmap_empty().
>
> thanks.
>>
>>> + hrtimer_start(&pmu_uncore->hrtimer,
>>> + ns_to_ktime(uncore_dev->hrtimer_interval),
>>> + HRTIMER_MODE_REL_PINNED);
>>> + }
>>> +}
>>
>> [...]
>>
>>> +static enum hrtimer_restart thunderx2_uncore_hrtimer_callback(
>>> + struct hrtimer *hrt)
>>
>> s/hrt/timer/ please
>
> sure, thanks.
>>
>>> +{
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + unsigned long irqflags;
>>> + int idx;
>>> + bool restart_timer = false;
>>> +
>>> + pmu_uncore = container_of(hrt, struct thunderx2_pmu_uncore_channel,
>>> + hrtimer);
>>> +
>>> + raw_spin_lock_irqsave(&pmu_uncore->uncore_dev->lock, irqflags);
>>> + for_each_set_bit(idx, pmu_uncore->active_counters,
>>> + pmu_uncore->uncore_dev->max_counters) {
>>> + struct perf_event *event = pmu_uncore->events[idx];
>>> +
>>> + thunderx2_uncore_update(event);
>>> + restart_timer = true;
>>> + }
>>> + raw_spin_unlock_irqrestore(&pmu_uncore->uncore_dev->lock, irqflags);
>>> +
>>> + if (restart_timer)
>>> + hrtimer_forward_now(hrt,
>>> + ns_to_ktime(
>>> + pmu_uncore->uncore_dev->hrtimer_interval));
>>> +
>>> + return restart_timer ? HRTIMER_RESTART : HRTIMER_NORESTART;
>>> +}
>>
>> You don't need to take the lock at all if there are no active events,
>> and we can avoid all the conditional logic that way. e.g. assuming the
>> renames I requested above:
>>
>> static enum hrtimer_restart tx2_hrtimer_callback(struct hrtimer *timer)
>>
>> {
>> struct tx2_pmu *pmu;
>> struct tx2_pmu_group *pmu_group;
>> unsigned long irqflags;
>> int max_counters;
>> int idx;
>>
>> pmu = container_of(timer, struct tx2_pmu, hrtimer);
>> pmu_group = pmu->pmu_group;
>> max_counters = pmu_group->max_counters;
>>
>> if (cpumask_empty(pmu->active_counters, max_counters))
>> return HRTIMER_NORESTART;
>>
>> raw_spin_lock_irqsave(&pmu_group->lock, irqflags);
>> for_each_set_bit(idx, pmu->active_counters, max_counters) {
>> struct perf_event *event = pmu->events[idx];
>>
>> tx2_uncore_update(event);
>> restart_timer = true;
>> }
>> raw_spin_unlock_irqrestore(&pmu_group->lock, irqflags);
>>
>> hrtimer_forward_now(hrt, ns_to_ktime(pmu_group->hrtimer_interval));
>>
>> return HRTIMER_RESTART;
>> }
>
> thanks, i will adopt this function.
>>
>> [...]
>>
>>> + switch (uncore_dev->type) {
>>> + case PMU_TYPE_L3C:
>>> + uncore_dev->max_counters = UNCORE_MAX_COUNTERS;
>>> + uncore_dev->max_channels = UNCORE_L3_MAX_TILES;
>>> + uncore_dev->max_events = L3_EVENT_MAX;
>>> + uncore_dev->hrtimer_interval = UNCORE_HRTIMER_INTERVAL;
>>> + uncore_dev->attr_groups = l3c_pmu_attr_groups;
>>> + uncore_dev->name = devm_kasprintf(dev, GFP_KERNEL,
>>> + "uncore_l3c_%d", uncore_dev->node);
>>> + uncore_dev->init_cntr_base = init_cntr_base_l3c;
>>> + uncore_dev->start_event = uncore_start_event_l3c;
>>> + uncore_dev->stop_event = uncore_stop_event_l3c;
>>> + uncore_dev->select_channel = uncore_select_channel;
>>> + break;
>>> + case PMU_TYPE_DMC:
>>> + uncore_dev->max_counters = UNCORE_MAX_COUNTERS;
>>> + uncore_dev->max_channels = UNCORE_DMC_MAX_CHANNELS;
>>> + uncore_dev->max_events = DMC_EVENT_MAX;
>>> + uncore_dev->hrtimer_interval = UNCORE_HRTIMER_INTERVAL;
>>> + uncore_dev->attr_groups = dmc_pmu_attr_groups;
>>> + uncore_dev->name = devm_kasprintf(dev, GFP_KERNEL,
>>> + "uncore_dmc_%d", uncore_dev->node);
>>> + uncore_dev->init_cntr_base = init_cntr_base_dmc;
>>> + uncore_dev->start_event = uncore_start_event_dmc;
>>> + uncore_dev->stop_event = uncore_stop_event_dmc;
>>> + uncore_dev->select_channel = uncore_select_channel;
>>> + break;
>>
>> We should probably s/uncore/tx2/, or s/uncore/thunderx2/ to namespace
>> this.
>
> ok, i would preffer tx2.
>>
>>> +static int thunderx2_uncore_pmu_offline_cpu(unsigned int cpu,
>>> + struct hlist_node *node)
>>> +{
>>> + int new_cpu;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> +
>>> + pmu_uncore = hlist_entry_safe(node,
>>> + struct thunderx2_pmu_uncore_channel, node);
>>> + if (cpu != pmu_uncore->cpu)
>>> + return 0;
>>> +
>>> + new_cpu = cpumask_any_and(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node),
>>> + cpu_online_mask);
>>> + if (new_cpu >= nr_cpu_ids)
>>> + return 0;
>>> +
>>> + pmu_uncore->cpu = new_cpu;
>>> + perf_pmu_migrate_context(&pmu_uncore->pmu, cpu, new_cpu);
>>> + return 0;
>>> +}
>>
>> We'll also need a onlining callback. Consider if all CPUs in a NUMA node
>> were offlined, then we tried to online an arbitrary CPU from that node.
>
> sure, will add.
>>
>> Thanks,
>> Mark.
>
> thanks
> Ganapat
thanks
Ganapat