Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update

From: Shivappa Vikas
Date: Tue Jan 17 2017 - 17:29:54 EST

Next message: Jeff Moyer: "Re: [patch] block: add blktrace C events for bio-based drivers"
Previous message: H. Peter Anvin: "Re: random: /dev/random often returns short reads"
In reply to: Thomas Gleixner: "Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 17 Jan 2017, Thomas Gleixner wrote:

On Fri, 6 Jan 2017, Vikas Shivappa wrote:

Introduce a scheduling hook finish_arch_pre_lock_switch which is
called just after the perf sched_in during context switch. This method
handles both cat and cqm sched in scenarios.

Sure, we need yet another special hook. What's wrong with
finish_arch_post_lock_switch()?

And again. This wants to be a seperate patch to the core code with a proper
justification for that hook. Dammit. Changelogs are supposed to explain WHY
not WHAT. How often do I have to explain that again?

Will fix. Will split this into three.
The sched hook (using the finish_arch_post_lock_switch) patch , perf sched in patch, actual write to msr.

The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
patch integrates the two msr writes to one. The common sched_in patch
checks if the per cpu cache has a different RMID or CLOSid than the task
and does the MSR write.

During sched_in the task uses the task RMID if the task is monitored or
else uses the task's cgroup rmid.

And that's relevant for that patch because it explains the existing
behaviour of the RMID, right?

Darn, again you create a unreviewable hodgepodge of changes. The whole
split of the RMID handling into a perf part and the actual RMID update can
be done as a seperate patch before switching over to the combined
RMID/COSID update mechanism.

+DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
+
/*
* Groups of events that have the same target(s), one RMID per group.
*/
@@ -108,7 +103,7 @@ struct sample {
* Likewise, an rmid value of -1 is used to indicate "no rmid currently
* assigned" and is used as part of the rotation code.
*/
-static inline bool __rmid_valid(u32 rmid)
+bool __rmid_valid(u32 rmid)

And once more this becomes global because there is no user outside of cqm.c.

{
if (!rmid || rmid > cqm_max_rmid)
return false;
@@ -161,7 +156,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
*
* We expect to be called with cache_mutex held.
*/
-static u32 __get_rmid(int domain)
+u32 __get_rmid(int domain)

Ditto.

Will fix the unnecessary globals. This should have been removed in this version as all of this should have gone with the removal of continuous monitoring.

{
struct list_head *cqm_flist;
struct cqm_rmid_entry *entry;
@@ -368,6 +363,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
}

+#ifdef CONFIG_CGROUP_PERF
+struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
+{
+ struct cgrp_cqm_info *ccinfo = NULL;
+ struct perf_cgroup *pcgrp;
+
+ pcgrp = perf_cgroup_from_task(tsk, NULL);
+
+ if (!pcgrp)
+ return NULL;
+ else
+ ccinfo = cgrp_to_cqm_info(pcgrp);
+
+ return ccinfo;

What the heck? Either you do:

struct cgrp_cqm_info *ccinfo = NULL;
struct perf_cgroup *pcgrp;

pcgrp = perf_cgroup_from_task(tsk, NULL);
if (pcgrp)
ccinfo = cgrp_to_cqm_info(pcgrp);

return ccinfo;

Will fix.

or

struct perf_cgroup *pcgrp;

pcgrp = perf_cgroup_from_task(tsk, NULL);
if (pcgrp)
return cgrp_to_cqm_info(pcgrp);
return NULL;

But the above combination does not make any sense at all. Hack at it until
it compiles and works by chance is not a really good engineering principle.

+}
+#endif
+
static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
{
if (rmid != NULL) {
@@ -713,26 +725,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
static void intel_cqm_event_start(struct perf_event *event, int mode)
{
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
- u32 rmid;

if (!(event->hw.cqm_state & PERF_HES_STOPPED))
return;

event->hw.cqm_state &= ~PERF_HES_STOPPED;

- alloc_needed_pkg_rmid(event->hw.cqm_rmid);
-
- rmid = event->hw.cqm_rmid[pkg_id];
- state->rmid = rmid;
- wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
+ if (is_task_event(event)) {
+ alloc_needed_pkg_rmid(event->hw.cqm_rmid);
+ state->next_task_rmid = event->hw.cqm_rmid[pkg_id];

Huch? When is this going to be evaluated? Assume the task is running on a
CPU in NOHZ full mode in user space w/o ever going through schedule. How is
that supposed to activate the event ever? Not, AFAICT.

+ }
}

static void intel_cqm_event_stop(struct perf_event *event, int mode)
{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
if (event->hw.cqm_state & PERF_HES_STOPPED)
return;

event->hw.cqm_state |= PERF_HES_STOPPED;
+ state->next_task_rmid = 0;

Ditto.

}

static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -1366,6 +1379,8 @@ static int __init intel_cqm_init(void)
if (mbm_enabled)
pr_info("Intel MBM enabled\n");

+ static_branch_enable(&cqm_enable_key);
+
+++ b/arch/x86/include/asm/intel_pqr_common.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_X86_INTEL_PQR_COMMON_H
+#define _ASM_X86_INTEL_PQR_COMMON_H
+
+#ifdef CONFIG_INTEL_RDT
+
+#include <linux/jump_label.h>
+#include <linux/types.h>

Missing new line to seperate include blocks

+#include <asm/percpu.h>
+#include <asm/msr.h>
+#include <asm/intel_rdt_common.h>
+
+void __intel_rdt_sched_in(void);
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports resource control and we enable by mounting the
+ * resctrl file system.
+ * - Caches the per cpu CLOSid values and does the MSR write only
+ * when a task with a different CLOSid is scheduled in.
+ */
+static inline void intel_rdt_sched_in(void)
+{
+ if (static_branch_likely(&rdt_enable_key) ||
+ static_branch_unlikely(&cqm_enable_key)) {

Groan. Why do you need TWO keys in order to make this decision? Just
because the context switch path is not yet bloated enough?

Will fix to a common rdt enable for both. Also will disable this when no event is monitored which is most likely the case.

+ __intel_rdt_sched_in();
+ }
+}
+
+#else
+
+static inline void intel_rdt_sched_in(void) {}
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 95ce5c8..3b4a099 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -5,7 +5,6 @@

#include <linux/kernfs.h>
#include <linux/jump_label.h>
-

Darn. Your random whitespace changes are annoying. And this one is actually
wrong. The new line is there on purpose to seperate linux and asm includes
visually.

Will fix.

#include <asm/intel_rdt_common.h>

diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
index e11ed5e..544acaa 100644
--- a/arch/x86/include/asm/intel_rdt_common.h
+++ b/arch/x86/include/asm/intel_rdt_common.h
@@ -18,12 +18,23 @@
*/
struct intel_pqr_state {
u32 rmid;
+ u32 next_task_rmid;
u32 closid;
int rmid_usecnt;
};

DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);

+u32 __get_rmid(int domain);
+bool __rmid_valid(u32 rmid);
+void alloc_needed_pkg_rmid(u32 *cqm_rmid);
+struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
+
+extern struct cgrp_cqm_info cqm_rootcginfo;

Yet another completely pointless export.

+DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
+DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
+
/**
* struct cgrp_cqm_info - perf_event cgroup metadata for cqm
* @cont_mon Continuous monitoring flag

diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
new file mode 100644
index 0000000..c3c50cd
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt_common.c
@@ -0,0 +1,98 @@
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/cacheinfo.h>
+#include <linux/cpuhotplug.h>

See above.

+#include <asm/intel-family.h>
+#include <asm/intel_rdt.h>
+
+/*
+ * The cached intel_pqr_state is strictly per CPU and can never be
+ * updated from a remote CPU. Both functions which modify the state
+ * (intel_cqm_event_start and intel_cqm_event_stop) are called with
+ * interrupts disabled, which is sufficient for the protection.

Just copying comments without actually fixing them is lazy at best. There
are other functions which modify that state.

+ */
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#define pkg_id topology_physical_package_id(smp_processor_id())

How often are you replicating this macro? Get rid of it completely
everywhere instead of spreading this crap all over the place.

Will fix

+#ifdef CONFIG_INTEL_RDT_M
+static inline int get_cgroup_sched_rmid(void)
+{
+#ifdef CONFIG_CGROUP_PERF
+ struct cgrp_cqm_info *ccinfo = NULL;
+
+ ccinfo = cqminfo_from_tsk(current);
+
+ if (!ccinfo)
+ return 0;
+
+ /*
+ * A cgroup is always monitoring for itself or
+ * for an ancestor(default is root).
+ */
+ if (ccinfo->mon_enabled) {
+ alloc_needed_pkg_rmid(ccinfo->rmid);
+ return ccinfo->rmid[pkg_id];
+ } else {
+ alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
+ return ccinfo->mfa->rmid[pkg_id];
+ }
+#endif
+
+ return 0;
+}
+
+static inline int get_sched_in_rmid(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ u32 rmid = 0;
+
+ rmid = state->next_task_rmid;
+
+ return rmid ? rmid : get_cgroup_sched_rmid();

So this effectively splits the perf sched in work into two parts. One is
handling the perf sched in mechanics and the other is doing the eventual
cgroup muck including rmid allocation.

Why on earth can't the perf sched in work update state->rmid right away and
this extra sched in function just handle CLOSID/RMID updates to the MSR ?
Because that would actually make sense, right?

I really can;t see how all of this hackery which goes through loops and
hoops over and over will make context switches faster. I'm inclined to bet
that the current code, which is halfways sane is faster even with two
writes.

Yes will fix this.
I was terrible to forgot to change this when i removed the contiuous monitoring. It indeed should make the sched code simpler which is always better.

Originally wanted because perf sched_in calls in all the cgroup ancestor events for each event in the hierarchy but we cannot have two different RMIDs (in cases where multiple cgroups are monitored in the same hierarchy) to be monitored at the same time. So we just introduce the _NO_RECURSIOn flag for perf to call only the event and not its ancestors.

+}
+#endif
+
+/*
+ * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ *
+ * Following considerations are made so that this has minimal impact
+ * on scheduler hot path:
+ * - This will stay as no-op unless we are running on an Intel SKU
+ * which supports resource control and we enable by mounting the
+ * resctrl file system or it supports resource monitoring.
+ * - Caches the per cpu CLOSid/RMID values and does the MSR write only
+ * when a task with a different CLOSid/RMID is scheduled in.
+ */
+void __intel_rdt_sched_in(void)
+{
+ struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+ int closid = 0;
+ u32 rmid = 0;

So rmid is a u32 and closid is int. Are you rolling a dice when chosing
data types?

Will fix..

Thanks,
Vikas

Thanks,

tglx

Next message: Jeff Moyer: "Re: [patch] block: add blktrace C events for bio-based drivers"
Previous message: H. Peter Anvin: "Re: random: /dev/random often returns short reads"
In reply to: Thomas Gleixner: "Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]