Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Ganapatrao Kulkarni
Date: Tue May 15 2018 - 06:33:28 EST
Hi Mark,
On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <gklkml16@xxxxxxxxx> wrote:
> Hi Mark,
>
> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>> Hi,
>>
>> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>> +
>>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>>> + * Each Channel supports UNCORE PMU device and consists of
>>> + * 4 independent programmable counters. Counters are 32 bit
>>> + * and does not support overflow interrupt, they needs to be
>>> + * sampled before overflow(i.e, at every 2 seconds).
>>> + */
>>> +
>>> +#define UNCORE_MAX_COUNTERS 4
>>> +#define UNCORE_L3_MAX_TILES 16
>>> +#define UNCORE_DMC_MAX_CHANNELS 8
>>> +
>>> +#define UNCORE_HRTIMER_INTERVAL (2 * NSEC_PER_SEC)
>>
>> How was a period of two seconds chosen?
>
> It has been suggested from hw team to sample before 3 to 4 seconds.
>
>>
>> What's the maximum clock speed for the L3C and DMC?
>
> L3C at 1.5GHz and DMC at 1.2GHz
>>
>> Given that all channels compete for access to the muxed register
>> interface, I suspect we need to try more often than once every 2
>> seconds...
>
> 2 seconds seems to be sufficient. So far testing looks good.
>
>>
>> [...]
>>
>>> +struct active_timer {
>>> + struct perf_event *event;
>>> + struct hrtimer hrtimer;
>>> +};
>>> +
>>> +/*
>>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>>> + * each uncore device has up to 16 channels, each channel can sample
>>> + * events independently with counters up to 4.
>>> + *
>>> + * struct thunderx2_pmu_uncore_channel created per channel.
>>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>>> + */
>>> +struct thunderx2_pmu_uncore_channel {
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + struct pmu pmu;
>>
>> Can we put the pmu first in the struct, please?
>
> ok
>>
>>> + int counter;
>>
>> AFAICT, this counter field is never used.
>
> hmm ok, will remove.
>>
>>> + int channel;
>>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>>> + struct active_timer *active_timers;
>>
>> You should only need a single timer per channel, rather than one per
>> event.
>>
>> I think you can get rid of the active_timer structure, and have:
>>
>> struct perf_event *events[UNCORE_MAX_COUNTERS];
>> struct hrtimer timer;
>>
>
> thanks, will change as suggested.
>
>>> + /* to sync counter alloc/release */
>>> + raw_spinlock_t lock;
>>> +};
>>> +
>>> +struct thunderx2_pmu_uncore_dev {
>>> + char *name;
>>> + struct device *dev;
>>> + enum thunderx2_uncore_type type;
>>> + unsigned long base;
>>
>> This should be:
>>
>> void __iomem *base;
>
> ok
>>
>> [...]
>>
>>> +static ssize_t cpumask_show(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + struct cpumask cpu_mask;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>>> +
>>> + /* Pick first online cpu from the node */
>>> + cpumask_clear(&cpu_mask);
>>> + cpumask_set_cpu(cpumask_first(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>>> + &cpu_mask);
>>> +
>>> + return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
>>> +}
>>
>> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>>
>> Regardless, I don't think that you can keep track of the management CPU
>> this way. Please keep track of the CPU the PMU should be managed by,
>> either with a cpumask or int embedded within the PMU structure.
>
> thanks, will add hotplug callbacks.
>>
>> At hotplug time, you'll need to update the management CPU, calling
>> perf_pmu_migrate_context() when it is offlined.
>
> ok
>>
>> [...]
>>
>>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + int counter;
>>> +
>>> + raw_spin_lock(&pmu_uncore->lock);
>>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>>> + pmu_uncore->uncore_dev->max_counters);
>>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> + return -ENOSPC;
>>> + }
>>> + set_bit(counter, pmu_uncore->counter_mask);
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> + return counter;
>>> +}
>>> +
>>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>>> + int counter)
>>> +{
>>> + raw_spin_lock(&pmu_uncore->lock);
>>> + clear_bit(counter, pmu_uncore->counter_mask);
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> +}
>>
>> I don't believe that locking is required in either of these, as the perf
>> core serializes pmu::add() and pmu::del(), where these get called.
without this locking, i am seeing "BUG: scheduling while atomic" when
i run perf with more events together than the maximum counters
supported
>
> thanks, it seems to be not required.
>>
>>> +
>>> +/*
>>> + * DMC and L3 counter interface is muxed across all channels.
>>> + * hence we need to select the channel before accessing counter
>>> + * data/control registers.
>>
>> Are there separate interfaces for all-dmc-channels and all-l3c-channels?
>
> separate interface for DMC and L3c.
> channels within DMC/L3C are muxed.
>
>>
>> ... or is a single interface used by all-dmc-and-l3c-channels?
>>
>>> + *
>>> + * L3 Tile and DMC channel selection is through SMC call
>>> + * SMC call arguments,
>>> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
>>> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
>>> + * x2 = Node id
>>
>> How do we map Linux node IDs to the firmware's view of node IDs?
>>
>> I don't believe the two are necessarily the same -- Linux's node IDs are
>> a Linux-specific construct.
>
> both are same, it is numa node id from ACPI/firmware.
>
>>
>> It would be much nicer if we could pass something based on the MPIDR,
>> which is a known HW construct, or if this implicitly affected the
>> current node.
>
> IMO, node id is sufficient.
>
>>
>> It would be vastly more sane for this to not be muxed at all. :/
>
> i am helpless due to crappy hw design!
>
>>
>>> + * x3 = DMC(1)/L3C(0)
>>> + * x4 = channel Id
>>> + */
>>> +static void uncore_select_channel(struct perf_event *event)
>>> +{
>>> + struct arm_smccc_res res;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> +
>>> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
>>> + pmu_uncore->uncore_dev->node,
>>> + pmu_uncore->uncore_dev->type,
>>> + pmu_uncore->channel, 0, 0, 0, &res);
>>> +
>>> +}
>>
>> Why aren't we checking the return value of the SMC call?
>
> thanks, i will add.
>
>>
>> The muxing and SMC sound quite scary. :/
>
> i too agree!, however i have no other option since the muxing register
> is a secure register.
>>
>>> +
>>> +static void uncore_start_event_l3c(struct perf_event *event, int flags)
>>> +{
>>> + u32 val;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + /* event id encoded in bits [07:03] */
>>> + val = GET_EVENTID(event) << 3;
>>> + reg_writel(val, hwc->config_base);
>>> +
>>> + if (flags & PERF_EF_RELOAD) {
>>> + u64 prev_raw_count =
>>> + local64_read(&event->hw.prev_count);
>>> + reg_writel(prev_raw_count, hwc->event_base);
>>> + }
>>> + local64_set(&event->hw.prev_count,
>>> + reg_readl(hwc->event_base));
>>
>> It would be simpler to ignore PERF_EF_RELOAD, and always reprogram the
>> prev_count and HW to zero.
>
> ok, will change
>>
>>> +}
>>> +
>>> +static void uncore_start_event_dmc(struct perf_event *event, int flags)
>>> +{
>>> + u32 val, event_shift = 8;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + /* enable and start counters and read current value in prev_count */
>>> + val = reg_readl(hwc->config_base);
>>> +
>>> + /* 8 bits for each counter,
>>> + * bits [05:01] of a counter to set event type.
>>> + */
>>> + reg_writel((val & ~(0x1f << (((GET_COUNTERID(event)) *
>>> + event_shift) + 1))) |
>>> + (GET_EVENTID(event) <<
>>> + (((GET_COUNTERID(event)) * event_shift) + 1)),
>>> + hwc->config_base);
>>
>> This would be far more legible if val were constructed before the
>> reg_writel(), especially if there were a helper for the event field
>> shifting, e.g.
>>
>> #define DMC_EVENT_CFG(idx, val) ((val) << (((idx) * 8) + 1))
>>
>> int id = GET_COUNTERID(event);
>> int event = GET_EVENTID(event);
>>
>> val = reg_readl(hwc->config_base);
>> val &= ~DMC_EVENT_CFG(id, 0x1f);
>> val |= DMCC_EVENT_CFG(id, event);
>> reg_writel(val, hwc->config_base));
>>
>> What are bits 7:6 and 1 used for?
>
> not used/reserved bits.
>
>>
>>> +
>>> + if (flags & PERF_EF_RELOAD) {
>>> + u64 prev_raw_count =
>>> + local64_read(&event->hw.prev_count);
>>> + reg_writel(prev_raw_count, hwc->event_base);
>>> + }
>>> + local64_set(&event->hw.prev_count,
>>> + reg_readl(hwc->event_base));
>>
>>
>> As with the L3C events, it would be simpler to always reprogram the
>> prev_count and HW to zero.
>
> ok
>>
>>> +}
>>> +
>>> +static void uncore_stop_event_l3c(struct perf_event *event)
>>> +{
>>> + reg_writel(0, event->hw.config_base);
>>> +}
>>> +
>>> +static void uncore_stop_event_dmc(struct perf_event *event)
>>> +{
>>> + u32 val, event_shift = 8;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + val = reg_readl(hwc->config_base);
>>> + reg_writel((val & ~(0xff << ((GET_COUNTERID(event)) * event_shift))),
>>> + hwc->config_base);
>>
>>
>> This could be simplified with the helper proposed above.
>
> ok
>>
>>> +}
>>> +
>>> +static void init_cntr_base_l3c(struct perf_event *event,
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev)
>>> +{
>>> +
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + /* counter ctrl/data reg offset at 8 */
>>
>> Offset 8, or stride 8?
>
> stride 8
>>
>> What does the register layout look like?
>>
>>> + hwc->config_base = uncore_dev->base
>>> + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
>>> + hwc->event_base = uncore_dev->base
>>> + + L3C_COUNTER_DATA + (8 * GET_COUNTERID(event));
>>> +}
>>> +
>>> +static void init_cntr_base_dmc(struct perf_event *event,
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev)
>>> +{
>>> +
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + hwc->config_base = uncore_dev->base
>>> + + DMC_COUNTER_CTL;
>>> + /* counter data reg offset at 0xc */
>>
>> A stride of 0xc seems unusual.
>>
>> What does the register layout look like?
>
> the offsets are at 0x640, 0x64c, 0x658, 0x664,
>>
>>> + hwc->event_base = uncore_dev->base
>>> + + DMC_COUNTER_DATA + (0xc * GET_COUNTERID(event));
>>> +}
>>> +
>>> +static void thunderx2_uncore_update(struct perf_event *event)
>>> +{
>>> + s64 prev, new = 0;
>>> + u64 delta;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + enum thunderx2_uncore_type type;
>>> +
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + type = pmu_uncore->uncore_dev->type;
>>
>> AFAICT this variable is not used below.
>
> thanks.
>>
>>> +
>>> + if (pmu_uncore->uncore_dev->select_channel)
>>> + pmu_uncore->uncore_dev->select_channel(event);
>>
>> This should always be non-NULL, right?
>
> yes, unwanted check at the moment, will remove.
>
>>
>> [...]
>>
>>> +static bool thunderx2_uncore_validate_event_group(struct perf_event *event)
>>> +{
>>> + struct pmu *pmu = event->pmu;
>>> + struct perf_event *leader = event->group_leader;
>>> + struct perf_event *sibling;
>>> + int counters = 0;
>>> +
>>> + if (leader->pmu != event->pmu && !is_software_event(leader))
>>> + return false;
>>> +
>>> + counters++;
>>
>> I don't think this is right when event != leader and the leader is a SW
>> event. In that case, the leader doesn't take a HW counter.
>
> I think this check to avoid the grouping of multiple hw PMUs ,
> however it is allowed to group sw events along with hw PMUs.
>
>>
>>> +
>>> + for_each_sibling_event(sibling, event->group_leader) {
>>> + if (is_software_event(sibling))
>>> + continue;
>>> + if (sibling->pmu != pmu)
>>> + return false;
>>> + counters++;
>>> + }
>>> +
>>> + /*
>>> + * If the group requires more counters than the HW has,
>>> + * it cannot ever be scheduled.
>>> + */
>>> + return counters <= UNCORE_MAX_COUNTERS;
>>> +}
>>
>> [...]
>>
>>> +static int thunderx2_uncore_event_init(struct perf_event *event)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> +
>>> + if (!pmu_uncore)
>>> + return -ENODEV;
>>
>> This cannot happen, given pmu_to_thunderx2_pmu_uncore() is a wrapper
>> around container_of().
>
> thanks, unnecessary check!
>>
>>> +
>>> + /* Pick first online cpu from the node */
>>> + event->cpu = cpumask_first(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node));
>>
>> I don't believe this is safe. You must keep track of which CPU is
>> managing the PMU, with hotplug callbacks.
>
> ok, will add callbacks.
>>
>> [...]
>>
>>> +static void thunderx2_uncore_start(struct perf_event *event, int flags)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long irqflags;
>>> + struct active_timer *active_timer;
>>> +
>>> + hwc->state = 0;
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + uncore_dev = pmu_uncore->uncore_dev;
>>> +
>>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>>> +
>>> + if (uncore_dev->select_channel)
>>> + uncore_dev->select_channel(event);
>>> + uncore_dev->start_event(event, flags);
>>> + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
>>> +
>>> + perf_event_update_userpage(event);
>>> + active_timer = &pmu_uncore->active_timers[GET_COUNTERID(event)];
>>> + active_timer->event = event;
>>> +
>>> + hrtimer_start(&active_timer->hrtimer,
>>> + ns_to_ktime(uncore_dev->hrtimer_interval),
>>> + HRTIMER_MODE_REL_PINNED);
>>> +}
>>
>> Please use a single hrtimer, and update *all* of the events when it
>> fires.
>
> sure
>>
>> I *think* that can be done in the pmu::pmu_enable() and
>> pmu::pmu_disable() callbacks.
>
> ok
>>
>> Are there control bits to enable/disable all counters, or can that only
>> be done through the event configuration registers?
>
> only through config register.
>>
>>> +static void thunderx2_uncore_stop(struct perf_event *event, int flags)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long irqflags;
>>> +
>>> + if (hwc->state & PERF_HES_UPTODATE)
>>> + return;
>>> +
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + uncore_dev = pmu_uncore->uncore_dev;
>>> +
>>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>>> +
>>> + if (uncore_dev->select_channel)
>>> + uncore_dev->select_channel(event);
>>
>> AFAICT this cannot be NULL.
>
> ok.
>>
>> [...]
>>
>>> +static int thunderx2_pmu_uncore_register(
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + struct device *dev = pmu_uncore->uncore_dev->dev;
>>> + char *name = pmu_uncore->uncore_dev->name;
>>> + int channel = pmu_uncore->channel;
>>> +
>>> + /* Perf event registration */
>>> + pmu_uncore->pmu = (struct pmu) {
>>> + .attr_groups = pmu_uncore->uncore_dev->attr_groups,
>>> + .task_ctx_nr = perf_invalid_context,
>>> + .event_init = thunderx2_uncore_event_init,
>>> + .add = thunderx2_uncore_add,
>>> + .del = thunderx2_uncore_del,
>>> + .start = thunderx2_uncore_start,
>>> + .stop = thunderx2_uncore_stop,
>>> + .read = thunderx2_uncore_read,
>>> + };
>>> +
>>> + pmu_uncore->pmu.name = devm_kasprintf(dev, GFP_KERNEL,
>>> + "%s_%d", name, channel);
>>
>> Does the channel idx take the NUMA node into account?
>
> name already has node id suffix.
>>
>> [...]
>>
>>> +static int thunderx2_pmu_uncore_add(struct thunderx2_pmu_uncore_dev *uncore_dev,
>>> + int channel)
>>> +{
>>
>>> + /* we can run up to (max_counters * max_channels) events
>>> + * simultaneously, allocate hrtimers per channel.
>>> + */
>>> + pmu_uncore->active_timers = devm_kzalloc(uncore_dev->dev,
>>> + sizeof(struct active_timer) * uncore_dev->max_counters,
>>> + GFP_KERNEL);
>>
>> Please just fold a single hrtimer into the thunderx2_pmu_uncore_channel
>> structure, and you can get rid of this allocation...
>
> sure, i will rewrite code to have timer per channel and all active
> counters are updated when timer fires.
>
>>
>>> +
>>> + for (counter = 0; counter < uncore_dev->max_counters; counter++) {
>>> + hrtimer_init(&pmu_uncore->active_timers[counter].hrtimer,
>>> + CLOCK_MONOTONIC,
>>> + HRTIMER_MODE_REL);
>>> + pmu_uncore->active_timers[counter].hrtimer.function =
>>> + thunderx2_uncore_hrtimer_callback;
>>> + }
>>
>> ... and simplify this initialization.
>
> ok
>>
>> [...]
>>
>>> +static struct thunderx2_pmu_uncore_dev *init_pmu_uncore_dev(
>>> + struct device *dev, acpi_handle handle,
>>> + struct acpi_device *adev, u32 type)
>>> +{
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long base;
>>
>>> + base = (unsigned long)devm_ioremap_resource(dev, &res);
>>> + if (IS_ERR((void *)base)) {
>>> + dev_err(dev, "PMU type %d: Fail to map resource\n", type);
>>> + return NULL;
>>> + }
>>
>> Please treat this as a void __iomem *base.
>
> ok
>>
>> [...]
>>
>>> +static int thunderx2_uncore_probe(struct platform_device *pdev)
>>> +{
>>> + struct device *dev = &pdev->dev;
>>> + struct arm_smccc_res res;
>>> +
>>> + set_dev_node(dev, acpi_get_node(ACPI_HANDLE(dev)));
>>> +
>>> + /* Make sure firmware supports DMC/L3C set channel smc call */
>>> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
>>> + dev_to_node(dev), 0, 0, 0, 0, 0, &res);
>>> + if (res.a0) {
>>> + dev_err(dev, "No Firmware support for PMU UNCORE(%d)\n",
>>> + dev_to_node(dev));
>>> + return -ENODEV;
>>> + }
>>
>> Please re-use the uncore_select_channel() wrapper rather than
>> open-coding this.
>
> ok.
>>
>> Which FW supports this interface?
>
> it is through vendor specific ATF runtime services.
>>
>> Thanks,
>> Mark.
>
> thanks,
> Ganapat
thanks
Ganapat