Re: [PATCH v3 04/46] perf/x86/intel/cmt: add device initialization and CPU hotplug support

From: Thomas Gleixner
Date: Thu Nov 10 2016 - 10:21:49 EST


On Sat, 29 Oct 2016, David Carrillo-Cisneros wrote:

> +static void free_pkg_data(struct pkg_data *pkg_data)
> +{
> + kfree(pkg_data);
> +}

So this is called from __terminate_pkg_data() which itself is called from
some random other place. Is this a code chasing game or is there a
technical reason why functions which belong together are not grouped
together?

> +/* Init pkg_data for @cpu 's package. */
> +static struct pkg_data *alloc_pkg_data(int cpu)
> +{
> + struct cpuinfo_x86 *c = &cpu_data(cpu);
> + struct pkg_data *pkgd;
> + int numa_node = cpu_to_node(cpu);
> + u16 pkgid = topology_logical_package_id(cpu);

Can you please sort the variables in reverse fir tree order?

u16 pkgid = topology_logical_package_id(cpu);
struct cpuinfo_x86 *c = &cpu_data(cpu);
int numa_node = cpu_to_node(cpu);
struct pkg_data *pkgd;

That's way simpler to parse than this random ordering.

> +
> + if (c->x86_cache_occ_scale != cmt_l3_scale) {

And why would c->x86_cache_occ_scale be intialized already when you really
hotplug a CPU? It cannot be initialized because it is done from
identify_cpu() when the cpu actually starts.

You just never noticed because the driver initialized _AFTER_ all the cpus
are brought up and you never bothered to limit the number of cpus which are
brought up at boot time to a single node and then bring up the other node
_after_ loading the driver.

> + /* 0 scale must have been converted to 1 automatically. */
> + if (c->x86_cache_occ_scale || cmt_l3_scale != 1) {

This check will just explode in your face when you do the above because
c->x86_cache_occ_scale is 0.

So IOW. This is broken and the wrong place to do this.

> + pr_err("Multiple LLC scale values, disabling CMT support.\n");

Interesting. You disable CMT support. That's true for init(). In the real
hotplug case you prevent bringing the cpu up, so the message is misleading
because CMT for the already online cpus is already working and keeps so.

> + return ERR_PTR(-ENXIO);
> + }
> + }
> +
> + pkgd = kzalloc_node(sizeof(*pkgd), GFP_KERNEL, numa_node);
> + if (!pkgd)
> + return ERR_PTR(-ENOMEM);
> +
> + pkgd->max_rmid = c->x86_cache_max_rmid;
> +
> + pkgd->work_cpu = cpu;

This is wrong. This want's to be -1 or something invalid. We now can stop
the hotplug process of a CPU at some random state. So if we stop right
after this callback then this not yet online cpu is set as work cpu and if
we then bring up another cpu in the package then it operates with a stale
work cpu. Please make stuff symmetric. The pre online prep stage is just
there to prepare data and pre initialize it. Anything which is related to
operational state has to be done at the point where things become
operational and undone at the same state when going down.

> + pkgd->pkgid = pkgid;
> +
> + __min_max_rmid = min(__min_max_rmid, pkgd->max_rmid);

What protects against the case where rmids are in use already and this cuts
__min_max_rmid short during hotplug?

> +static int init_pkg_data(int cpu)
> +{
> + struct pkg_data *pkgd;
> + u16 pkgid = topology_logical_package_id(cpu);
> +
> + lockdep_assert_held(&cmt_mutex);
> +
> + /* Verify that this pkgid isn't already initialized. */
> + if (WARN_ON_ONCE(cmt_pkgs_data[pkgid]))
> + return -EPERM;

For one this is a direct dereference of something which claims to be rcu
protected. That's inconsistent.

Further this check is completely pointless. This function is called from
intel_cmt_prep_up() after detecting that there is no package data for this
particular package id. I'm all for defensive programming, but this is just
beyond silly.

Aside of that why are you looking up pkgid in three functions in a row
instead of simply handing it from one to the other?

cmt_prep_up() -> init_pkg_data() -> alloc_pkg_data()

> + pkgd = alloc_pkg_data(cpu);
> + if (IS_ERR(pkgd))
> + return PTR_ERR(pkgd);
> +
> + rcu_assign_pointer(cmt_pkgs_data[pkgid], pkgd);
> + synchronize_rcu();

And this synchronize_rcu() is required because of what? This is the first
CPU of a package being brought up and you are holding the cmt_mutex.

I might be missing something, but if so, then this is missing a comment.

> +static int intel_cmt_prep_down(unsigned int cpu)
> +{
> + struct pkg_data *pkgd;
> + u16 pkgid = topology_logical_package_id(cpu);
> +
> + mutex_lock(&cmt_mutex);
> + pkgd = rcu_dereference_protected(cmt_pkgs_data[pkgid],
> + lockdep_is_held(&cmt_mutex));
> + if (pkgd->work_cpu >= nr_cpu_ids) {
> + /* will destroy pkgd */
> + __terminate_pkg_data(pkgd);

Oh right. You free data _BEFORE_ setting the pointer to NULL and
synchronizing RCU. So anything which is merily using rcu_read_lock() can
get a valid reference and fiddle with freed data. Well done.

> + RCU_INIT_POINTER(cmt_pkgs_data[pkgid], NULL);
> + synchronize_rcu();

> +static int __init cmt_start(void)
> +{
> + char *str, scale[20];
> + int err;
> +
> + /* will be modified by init_pkg_data() in intel_cmt_prep_up(). */
> + __min_max_rmid = UINT_MAX;
> + err = cpuhp_setup_state(CPUHP_PERF_X86_CMT_PREP,
> + "PERF_X86_CMT_PREP",
> + intel_cmt_prep_up,
> + intel_cmt_prep_down);
> + if (err)
> + return err;
> +
> + err = cpuhp_setup_state(CPUHP_AP_PERF_X86_CMT_ONLINE,
> + "AP_PERF_X86_CMT_ONLINE",
> + intel_cmt_hp_online_enter,
> + intel_cmt_hp_online_exit);
> + if (err)
> + goto rm_prep;
> +
> + snprintf(scale, sizeof(scale), "%u", cmt_l3_scale);
> + str = kstrdup(scale, GFP_KERNEL);

That string is duplicated for memory leak detector testing purposes or
what?

> + if (!str) {
> + err = -ENOMEM;
> + goto rm_online;
> + }
> +
> + return 0;
> +
> +rm_online:
> + cpuhp_remove_state(CPUHP_AP_PERF_X86_CMT_ONLINE);
> +rm_prep:
> + cpuhp_remove_state(CPUHP_PERF_X86_CMT_PREP);
> +
> + return err;
> +}
> +
> +static int __init intel_cmt_init(void)
> +{
> + struct pkg_data *pkgd = NULL;
> + int err = 0;
> +
> + if (!x86_match_cpu(intel_cmt_match)) {
> + err = -ENODEV;
> + goto err_exit;

This is crap. If a CPU does NOT support this then printing the
registration failed error is just confusing. If it's not supported, return
-ENODEV and be done with it.

> + }
> +
> + err = cmt_alloc();
> + if (err)
> + goto err_exit;
> +
> + err = cmt_start();
> + if (err)
> + goto err_dealloc;
> +
> + pr_info("Intel CMT enabled with ");
> + rcu_read_lock();
> + while ((pkgd = cmt_pkgs_data_next_rcu(pkgd))) {
> + pr_cont("%d RMIDs for pkg %d, ",
> + pkgd->max_rmid + 1, pkgd->pkgid);

And this is useful because? Because it's so nice to have random stuff printed.

The only valuable information is that this is enabled with the detected
possible number of rmids (__min_max_rmid or whatever incomprehensible
variable name you came up with).

> + }
> + rcu_read_unlock();
> + pr_cont("and l3 scale of %d KBs.\n", cmt_l3_scale);

i.e this should be:

pr_info("Intel CMT enabled. %d RMIDs, L3 scale %d KBs", ....);

Am I missing something?

If you really want to print out the per package rmdis, then the only reason
to do so is when they atually differ and that can be done where you do that
min() check.

> +
> +device_initcall(intel_cmt_init);

Oh no. This wants to be a module from the very beginning. No point on
forcing this as builtin for no reason.

Thanks,

tglx