Re: [PATCH v2 06/32] perf/x86/intel/cqm: add per-package RMIDs, data and locks
From: Thomas Gleixner
Date: Wed May 18 2016 - 12:10:20 EST
On Wed, 11 May 2016, David Carrillo-Cisneros wrote:
> +/* Init cqm pkg_data for @cpu 's package. */
> +static int pkg_data_init_cpu(int cpu)
> +{
> + struct pkg_data *pkg_data;
> + struct cpuinfo_x86 *c = &cpu_data(cpu);
> + u16 pkg_id = topology_physical_package_id(cpu);
> +
> + if (cqm_pkgs_data[pkg_id])
> + return 0;
> +
> +
> + pkg_data = kmalloc_node(sizeof(struct pkg_data),
> + GFP_KERNEL, cpu_to_node(cpu));
kzalloc_node() please
> + if (!pkg_data)
> + return -ENOMEM;
> +
> + pkg_data->max_rmid = c->x86_cache_max_rmid;
> +
> + /* Does hardware has more rmids than this driver can handle? */
> + if (WARN_ON(pkg_data->max_rmid >= INVALID_RMID))
> + pkg_data->max_rmid = INVALID_RMID - 1;
> +
> + if (c->x86_cache_occ_scale != cqm_l3_scale) {
> + pr_err("Multiple LLC scale values, disabling\n");
> + kfree(pkg_data);
> + return -EINVAL;
> + }
Please move this check before the allocation.
> + pkg_data->prmids_by_rmid = kmalloc_node(
> + sizeof(struct prmid *) * (1 + pkg_data->max_rmid),
> + GFP_KERNEL, cpu_to_node(cpu));
kzalloc_node()
> +
> + if (!pkg_data) {
> + kfree(pkg_data);
Huch? You alloc pkg_data->prmids_by_rmid and then check pkg_data, which is
guaranteed to be non NULL here.
> + return -ENOMEM;
> + }
> +
> + INIT_LIST_HEAD(&pkg_data->free_prmids_pool);
> +
> + mutex_init(&pkg_data->pkg_data_mutex);
> + raw_spin_lock_init(&pkg_data->pkg_data_lock);
> +
> + /* XXX: Chose randomly*/
Please add coherent comments, e.g.:
/*
* We should select the rotation_cpu randomly in the package,
* because ..... Use the current cpu for now.
*/
> + pkg_data->rotation_cpu = cpu;
> +
> + cqm_pkgs_data[pkg_id] = pkg_data;
> + return 0;
> +}
> +
> +static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
> +{
> + int r;
> + unsigned long flags;
> + struct prmid *prmid;
> + struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
This variable ordering is really hard to read.
struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
struct prmid *prmid;
unsigned long flags;
int r;
Parses faster, but that's my personal taste and the least of my worries with
this patch.
> +
> + if (!__valid_pkg_id(pkg_id))
> + return -EINVAL;
> +
> + for (r = 0; r <= pkg_data->max_rmid; r++) {
> +
> + prmid = kmalloc_node(sizeof(struct prmid), GFP_KERNEL,
> + cpu_to_node(pkg_data->rotation_cpu));
> + if (!prmid)
> + goto fail;
> +
> + atomic64_set(&prmid->last_read_value, 0L);
> + atomic64_set(&prmid->last_read_time, 0L);
> + INIT_LIST_HEAD(&prmid->pool_entry);
> + prmid->rmid = r;
> +
> + /* Lock needed if called during CPU hotplug. */
> + raw_spin_lock_irqsave_nested(
> + &pkg_data->pkg_data_lock, flags, pkg_id);
> + pkg_data->prmids_by_rmid[r] = prmid;
So this adds the prmid to the array, but it does not link it into the free
pool list. So the fail path is completely bogus.
> + /* RMID 0 is special and makes the root of rmid hierarchy. */
This comment does not make any sense here.
> + raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
> + }
> + return 0;
> +fail:
> + while (!list_empty(&pkg_data->free_prmids_pool)) {
> + prmid = list_first_entry(&pkg_data->free_prmids_pool,
> + struct prmid, pool_entry);
> + list_del(&prmid->pool_entry);
> + kfree(pkg_data->prmids_by_rmid[prmid->rmid]);
> + kfree(prmid);
> + }
> + return -ENOMEM;
Please split out the fail path into a seperate function so it can be reused
for error handling and module removal.
> +/*
> + * Determine if @a and @b measure the same set of tasks.
> + *
> + * If @a and @b measure the same set of tasks then we want to share a
> + * single RMID.
> + */
> +static bool __match_event(struct perf_event *a, struct perf_event *b)
Can you please split out the event support and just keep the infrastructure
setup in this patch for review purposes? We can fold stuff together once we
are done with this.
> +static inline void cqm_pick_event_reader(int cpu)
> +{
> + u16 pkg_id = topology_physical_package_id(cpu);
> + /* XXX: lock, check if rotation cpu is online, maybe */
So I assume 'XXX' means: FIXME or such.
1) If you have no idea whether you need the lock then you better make your
mind up before posting.
2) @cpu is online as this is called from CPU_STARTING resp. from the
init code.
So please can you fix this stuff and/or remove crap comments like the above.
> + /*
> + * Pick a reader if there isn't one already.
> + */
> + if (cqm_pkgs_data[pkg_id]->rotation_cpu != -1)
> + cqm_pkgs_data[pkg_id]->rotation_cpu = cpu;
This does not do what the comment says. If rotation_cpu is -1, i.e. not
assigned then it stays that way and if its assigned it is overwritten
unconditionally.
> +}
> +
> +static void intel_cqm_cpu_starting(unsigned int cpu)
> +{
> + struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
> + struct cpuinfo_x86 *c = &cpu_data(cpu);
> + u16 pkg_id = topology_physical_package_id(cpu);
> +
> + state->rmid = 0;
> + state->closid = 0;
That's a pointless exercise. If the cpu was never in use then this is 0. If
the cpu goes down, then this better is cleaned up before it vanishes.
> +
> + /* XXX: lock */
Sigh.
> + /* XXX: Make sure this case is handled when hotplug happens. */
> + WARN_ON(c->x86_cache_max_rmid != cqm_pkgs_data[pkg_id]->max_rmid);
> + WARN_ON(c->x86_cache_occ_scale != cqm_l3_scale);
If your code depends on the consistency then you better fix this now. No point
in emitting a warning with a well known stack trace and then let the code fall
over somewhere else.
The proper thing to do here is to do the check and mark the package unusable
if the check fails. That's all we can do here.
> +}
> +
> +static void intel_cqm_cpu_exit(unsigned int cpu)
> +{
> + /*
> + * Is @cpu a designated cqm reader?
> + */
> + u16 pkg_id = topology_physical_package_id(cpu);
> +
> + if (cqm_pkgs_data[pkg_id]->rotation_cpu != cpu)
> + return;
> + /* XXX: do remove unused packages */
XXX: No. Leave allocated memory around and be done with it. Especially as you
must handle CPU_DOWN_FAILED ....
> + cqm_pkgs_data[pkg_id]->rotation_cpu = cpumask_any_but(
> + topology_core_cpumask(cpu), cpu);
> +}
> +
> +static int intel_cqm_cpu_notifier(struct notifier_block *nb,
> + unsigned long action, void *hcpu)
> +{
> + unsigned int cpu = (unsigned long)hcpu;
> +
> + switch (action & ~CPU_TASKS_FROZEN) {
> + case CPU_DOWN_PREPARE:
> + intel_cqm_cpu_exit(cpu);
> + break;
Misses CPU_DOWN_FAILED handling
> + case CPU_STARTING:
> + pkg_data_init_cpu(cpu);
This definitely never saw any testing. You cannot call pkg_data_init_cpu()
from here. This code is called with interrupts disabled on the upcoming cpu.
The allocation stuff must happen before this in UP_PREPARE and fail properly
if the allocations fails. Otherwise you dereference a NULL pointer in
cqm_pick_event_reader() ...
> + intel_cqm_cpu_starting(cpu);
> + cqm_pick_event_reader(cpu);
> + break;
> + }
> +
> + return NOTIFY_OK;
> +}
> +
> +static const struct x86_cpu_id intel_cqm_match[] = {
> + { .vendor = X86_VENDOR_INTEL, .feature = X86_FEATURE_CQM_OCCUP_LLC },
> + {}
> +};
> +
> +static int __init intel_cqm_init(void)
> +{
> + char *str, scale[20];
> + int i, cpu, ret = 0, min_max_rmid = 0;
> +
> + if (!x86_match_cpu(intel_cqm_match))
> + return -ENODEV;
> +
> + cqm_l3_scale = boot_cpu_data.x86_cache_occ_scale;
> + if (WARN_ON(cqm_l3_scale == 0))
There is no point in using a WARN_ON is a well known call chain.
> + cqm_l3_scale = 1;
> +
> + cqm_pkgs_data = kmalloc(
> + sizeof(struct pkg_data *) * topology_max_packages(),
> + GFP_KERNEL);
> + if (!cqm_pkgs_data)
> + return -ENOMEM;
> +
> + for (i = 0; i < topology_max_packages(); i++)
> + cqm_pkgs_data[i] = NULL;
kzalloc() is your friend.
> + /*
> + * It's possible that not all resources support the same number
> + * of RMIDs. Instead of making scheduling much more complicated
> + * (where we have to match a task's RMID to a cpu that supports
> + * that many RMIDs) just find the minimum RMIDs supported across
> + * all cpus.
> + *
> + * Also, check that the scales match on all cpus.
> + */
> + cpu_notifier_register_begin();
> +
> + /* XXX: assert all cpus in pkg have same nr rmids (they should). */
Your commentry is really annoying. This XXX crap is just telling me that this
whole thing is half baken.
> + for_each_online_cpu(cpu) {
> + ret = pkg_data_init_cpu(cpu);
> + if (ret)
> + goto error;
> + }
> +
> + /* Select the minimum of the maximum rmids to use as limit for
> + * threshold. XXX: per-package threshold.
> + */
> + cqm_pkg_id_for_each_online(i) {
> + if (min_max_rmid < cqm_pkgs_data[i]->max_rmid)
> + min_max_rmid = cqm_pkgs_data[i]->max_rmid;
> + intel_cqm_setup_pkg_prmid_pools(i);
So if intel_cqm_setup_pkg_prmid_pools() fails, we happily proceed, right?
> + /*
> + * A reasonable upper limit on the max threshold is the number
> + * of lines tagged per RMID if all RMIDs have the same number of
> + * lines tagged in the LLC.
> + *
> + * For a 35MB LLC and 56 RMIDs, this is ~1.8% of the LLC.
> + */
> + __intel_cqm_max_threshold =
> + boot_cpu_data.x86_cache_size * 1024 / (min_max_rmid + 1);
> +
> + snprintf(scale, sizeof(scale), "%u", cqm_l3_scale);
> + str = kstrdup(scale, GFP_KERNEL);
> + if (!str) {
> + ret = -ENOMEM;
Sure, we have no memory and leak the already allocated stuff. This wants
proper error handling. And by proper error handling I don't mean
error1:
error2:
....
errroN:
style at the end of this function. This wants to be a proper rollback which
can also be used for module unload.
> + goto error;
> + }
> + event_attr_intel_cqm_llc_scale.event_str = str;
> +
> + for_each_online_cpu(i) {
> + intel_cqm_cpu_starting(i);
> + cqm_pick_event_reader(i);
> + }
> +
> + __perf_cpu_notifier(intel_cqm_cpu_notifier);
> +
> + ret = perf_pmu_register(&intel_cqm_pmu, "intel_cqm", -1);
> + if (ret)
> + goto error;
> +
> + cpu_notifier_register_done();
> +
> + pr_info("Intel CQM monitoring enabled with at least %u rmids per package.\n",
> + min_max_rmid + 1);
> +
> + return ret;
> +
> +error:
> + pr_err("Intel CQM perf registration failed: %d\n", ret);
> + cpu_notifier_register_done();
> +
> + return ret;
> +}
> +
> +device_initcall(intel_cqm_init);
Please make this a modular driver from the very beginning.
Thanks,
tglx