Re: [PATCH] x86: Intel Cache Allocation Technology support

From: Vikas Shivappa
Date: Fri Nov 21 2014 - 16:25:56 EST



Correcting email address for Matt.

On Wed, 19 Nov 2014, Vikas Shivappa wrote:

What is Cache Allocation Technology ( CAT )
-------------------------------------------

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating
a line in cache ie when pulling new data into the cache. The
programming of the h/w is done via programming MSRs.

The different cache subsets are identified by CLOS identifier (class of
service) and each CLOS has a CBM (cache bit mask). The CBM is a
contiguous set of bits which defines the amount of cache resource that
is available for each 'subset'.

Why is CAT (cache allocation technology) needed
------------------------------------------------

The CAT enables more cache resources to be made available for higher
priority applications based on guidance from the execution
environment.

The architecture also allows dynamically changing these subsets during
runtime to further optimize the performance of the higher priority
application with minimal degradation to the low priority app.
Additionally, resources can be rebalanced for system throughput benefit.

This technique may be useful in managing large computer systems which
large LLC. Examples may be large servers running instances of
webservers or database servers. In such complex systems, these subsets
can be used for more careful placing of the available cache resources.

The CAT kernel patch would provide a basic kernel framework for users to
be able to implement such cache subsets.

Kernel Implementation
---------------------

This patch implements a cgroup subsystem to support cache allocation.
Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. A
CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
to the kernel and not exposed to user. Each cgroup would have one CBM
and would just represent one cache 'subset'.

The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
cgroup never fails. When a child cgroup is created it inherits the
CLOSid and the CBM from its parent. When a user changes the default
CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
used before. The changing of 'cbm' may fail with -ERRNOSPC once the
kernel runs out of maximum CLOSids it can support.
User can create as many cgroups as he wants but having different CBMs
at the same time is restricted by the maximum number of CLOSids
(multiple cgroups can have the same CBM).
Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
for each cgroup using a CLOSid.

The tasks in the cgroup would get to fill the LLC cache represented by
the cgroup's 'cbm' file.

Root directory would have all available bits set in 'cbm' file by
default.

Assignment of CBM,CLOS
---------------------------------

The 'cbm' needs to be a subset of the parent node's 'cbm'. Any
contiguous subset of these bits(with a minimum of 2 bits) maybe set to
indicate the cache mapping desired. The 'cbm' between 2 directories can
overlap. The 'cbm' would represent the cache 'subset' of the CAT cgroup.
For ex: on a system with 16 bits of max cbm bits, if the directory has
the least significant 4 bits set in its 'cbm' file(meaning the 'cbm' is
just 0xf), it would be allocated the right quarter of the Last level
cache which means the tasks belonging to this CAT cgroup can use the
right quarter of the cache to fill. If it has the most significant 8
bits set ,it would be allocated the left half of the cache(8 bits out
of 16 represents 50%).

The cache portion defined in the CBM file is available to all tasks
within the cgroup to fill and these task are not allowed to allocate
space in other parts of the cache.

Scheduling and Context Switch
------------------------------

During context switch kernel implements this by writing the CLOSid
(internally maintained by kernel) of the cgroup to which the task
belongs to the CPU's IA32_PQR_ASSOC MSR.

Reviewed-by: Matt Flemming <matt.flemming@xxxxxxxxx>
Tested-by: Priya Autee <priya.v.autee@xxxxxxxxx>
Signed-off-by: Vikas Shivappa <vikas.shivappa@xxxxxxxxxxxxxxx>
---
arch/x86/include/asm/cacheqe.h | 144 +++++++++++
arch/x86/include/asm/cpufeature.h | 4 +
arch/x86/include/asm/processor.h | 5 +-
arch/x86/kernel/cpu/Makefile | 5 +
arch/x86/kernel/cpu/cacheqe.c | 487 ++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/common.c | 21 ++
include/linux/cgroup_subsys.h | 5 +
init/Kconfig | 22 ++
kernel/sched/core.c | 4 +-
kernel/sched/sched.h | 24 ++
10 files changed, 718 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/cacheqe.h
create mode 100644 arch/x86/kernel/cpu/cacheqe.c

diff --git a/arch/x86/include/asm/cacheqe.h b/arch/x86/include/asm/cacheqe.h
new file mode 100644
index 0000000..91d175e
--- /dev/null
+++ b/arch/x86/include/asm/cacheqe.h
@@ -0,0 +1,144 @@
+#ifndef _CACHEQE_H_
+#define _CACHEQE_H_
+
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/cpumask.h>
+#include <linux/seq_file.h>
+#include <linux/rcupdate.h>
+#include <linux/kernel_stat.h>
+#include <linux/err.h>
+
+#ifdef CONFIG_CGROUP_CACHEQE
+
+#define IA32_PQR_ASSOC 0xc8f
+#define IA32_PQR_MASK(x) (x << 32)
+
+/* maximum possible cbm length */
+#define MAX_CBM_LENGTH 32
+
+#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
+
+#define IA32_CBM_MASK 0xffffffff
+#define IA32_L3_CBM_BASE 0xc90
+#define CQECBMMSR(x) (IA32_L3_CBM_BASE + x)
+
+#ifdef CONFIG_CACHEQE_DEBUG
+#define CQE_DEBUG(X) do { pr_info X; } while (0)
+#else
+#define CQE_DEBUG(X)
+#endif
+
+extern bool cqe_genable;
+
+struct cacheqe_subsys_info {
+ unsigned long *closmap;
+};
+
+struct cacheqe {
+ struct cgroup_subsys_state css;
+
+ /* class of service for the group*/
+ unsigned int clos;
+ /* corresponding cache bit mask*/
+ unsigned long *cbm;
+
+};
+
+struct closcbm_map {
+ unsigned long cbm;
+ unsigned int ref;
+};
+
+extern struct cacheqe root_cqe_group;
+
+/*
+ * Return cacheqos group corresponding to this container.
+ */
+static inline struct cacheqe *css_cacheqe(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct cacheqe, css) : NULL;
+}
+
+static inline struct cacheqe *parent_cqe(struct cacheqe *cq)
+{
+ return css_cacheqe(cq->css.parent);
+}
+
+/*
+ * Return cacheqe group to which this task belongs.
+ */
+static inline struct cacheqe *task_cacheqe(struct task_struct *task)
+{
+ return css_cacheqe(task_css(task, cacheqe_cgrp_id));
+}
+
+static inline void cacheqe_sched_in(struct task_struct *task)
+{
+ struct cacheqe *cq;
+ unsigned int clos;
+ unsigned int l, h;
+
+ if (!cqe_genable)
+ return;
+
+ rdmsr(IA32_PQR_ASSOC, l, h);
+
+ rcu_read_lock();
+ cq = task_cacheqe(task);
+
+ if (cq == NULL || cq->clos == h) {
+ rcu_read_unlock();
+ return;
+ }
+
+ clos = cq->clos;
+
+ /*
+ * After finding the cacheqe of the task , write the PQR for the proc.
+ * We are assuming the current core is the one its scheduled to.
+ * In unified scheduling , write the PQR each time.
+ */
+ wrmsr(IA32_PQR_ASSOC, l, clos);
+ rcu_read_unlock();
+
+ CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
+ clos, task_cpu(task), smp_processor_id(), task->pid));
+
+}
+
+static inline void cacheqe_sched_out(struct task_struct *task)
+{
+ unsigned int l, h;
+
+ if (!cqe_genable)
+ return;
+
+ rdmsr(IA32_PQR_ASSOC, l, h);
+
+ if (h == 0)
+ return;
+
+ /*
+ *After finding the cacheqe of the task , write the PQR for the proc.
+ * We are assuming the current core is the one its scheduled to.
+ * Write zero when scheduling out so that we get a more accurate
+ * cache allocation.
+ */
+
+ wrmsr(IA32_PQR_ASSOC, l, 0);
+
+ CQE_DEBUG(("schedule out done cpu :%u,curr cpu:%u, pid:%u\n",
+ task_cpu(task), smp_processor_id(), task->pid));
+
+}
+
+#else
+static inline void cacheqe_sched_in(struct task_struct *task) {}
+
+static inline void cacheqe_sched_out(struct task_struct *task) {}
+
+#endif
+#endif
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 0bb1335..21290ac 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -221,6 +221,7 @@
#define X86_FEATURE_INVPCID ( 9*32+10) /* Invalidate Processor Context ID */
#define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
#define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_CQE (9*32+15) /* Cache QOS Enforcement */
#define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
#define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
#define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
@@ -236,6 +237,9 @@
#define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 */
#define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS */

+/* Intel-defined CPU features, CPUID level 0x0000000A:0 (ebx), word 10 */
+#define X86_FEATURE_CQE_L3 (10*32 + 1)
+
/*
* BUG word(s)
*/
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..6be953f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -111,8 +111,11 @@ struct cpuinfo_x86 {
int x86_cache_alignment; /* In bytes */
int x86_power;
unsigned long loops_per_jiffy;
+ /* Cache QOS Enforement values */
+ int x86_cqe_cbmlength;
+ int x86_cqe_closs;
/* cpuid returned max cores value: */
- u16 x86_max_cores;
+ u16 x86_max_cores;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index e27b49d..c2b0a6b 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -8,6 +8,10 @@ CFLAGS_REMOVE_common.o = -pg
CFLAGS_REMOVE_perf_event.o = -pg
endif

+ifdef CONFIG_CACHEQE_DEBUG
+CFLAGS_cacheqe.o := -DDEBUG
+endif
+
# Make sure load_percpu_segment has no stackprotector
nostackp := $(call cc-option, -fno-stack-protector)
CFLAGS_common.o := $(nostackp)
@@ -47,6 +51,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_nhmex.o
endif

+obj-$(CONFIG_CGROUP_CACHEQE) += cacheqe.o

obj-$(CONFIG_X86_MCE) += mcheck/
obj-$(CONFIG_MTRR) += mtrr/
diff --git a/arch/x86/kernel/cpu/cacheqe.c b/arch/x86/kernel/cpu/cacheqe.c
new file mode 100644
index 0000000..2ac3d4e
--- /dev/null
+++ b/arch/x86/kernel/cpu/cacheqe.c
@@ -0,0 +1,487 @@
+
+/*
+ * kernel/cacheqe.c
+ *
+ * Processor Cache Allocation code
+ * (Also called cache quality enforcement - cqe)
+ *
+ * Copyright (c) 2014, Intel Corporation.
+ *
+ * 2014-10-15 Written by Vikas Shivappa
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <asm/cacheqe.h>
+
+struct cacheqe root_cqe_group;
+static DEFINE_MUTEX(cqe_group_mutex);
+
+bool cqe_genable;
+
+/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
+
+static struct closcbm_map *ccmap;
+static struct cacheqe_subsys_info *cqess_info;
+
+char hsw_brandstrs[5][64] = {
+ "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
+ "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
+ "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
+ "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
+ "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
+};
+
+#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
+ css_for_each_child((pos_css), \
+ &(parent_cq)->css)
+
+#if CONFIG_CACHEQE_DEBUG
+
+/*DUMP the closid-cbm map.*/
+
+static inline void cbmmap_dump(void)
+{
+
+ int i;
+
+ pr_debug("CBMMAP\n");
+ for (i = 0; i < boot_cpu_data.x86_cqe_closs; i++)
+ pr_debug("cbm: 0x%x,ref: %u\n",
+ (unsigned int)ccmap[i].cbm, ccmap[i].ref);
+
+}
+
+#else
+
+static inline void cbmmap_dump(void) {}
+
+#endif
+
+static inline bool cqe_enabled(struct cpuinfo_x86 *c)
+{
+
+ int i;
+
+ if (cpu_has(c, X86_FEATURE_CQE_L3))
+ return true;
+
+ /*
+ * Hard code the checks and values for HSW SKUs.
+ * Unfortunately! have to check against only these brand name strings.
+ */
+
+ for (i = 0; i < 5; i++)
+ if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
+ c->x86_cqe_closs = 4;
+ c->x86_cqe_cbmlength = 20;
+ return true;
+ }
+
+ return false;
+
+}
+
+
+static int __init cqe_late_init(void)
+{
+
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+ size_t sizeb;
+ int maxid = boot_cpu_data.x86_cqe_closs;
+
+ cqe_genable = false;
+
+ /*
+ * Need the cqe_genable hint helps decide if the
+ * kernel has enabled cache allocation.
+ */
+
+ if (!cqe_enabled(c)) {
+
+ root_cqe_group.css.ss->disabled = 1;
+ return -ENODEV;
+
+ } else {
+
+ cqess_info =
+ kzalloc(sizeof(struct cacheqe_subsys_info),
+ GFP_KERNEL);
+
+ if (!cqess_info)
+ return -ENOMEM;
+
+ sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
+ cqess_info->closmap =
+ kzalloc(sizeb, GFP_KERNEL);
+
+ if (!cqess_info->closmap) {
+ kfree(cqess_info);
+ return -ENOMEM;
+ }
+
+ sizeb = maxid * sizeof(struct closcbm_map);
+ ccmap = kzalloc(sizeb, GFP_KERNEL);
+
+ if (!ccmap)
+ return -ENOMEM;
+
+ /* Allocate the CLOS for root.*/
+ set_bit(0, cqess_info->closmap);
+ root_cqe_group.clos = 0;
+
+ /*
+ * The cbmlength expected be atleast 1.
+ * All bits are set for the root cbm.
+ */
+
+ ccmap[root_cqe_group.clos].cbm =
+ (u32)((u64)(1 << c->x86_cqe_cbmlength) - 1);
+ root_cqe_group.cbm = &ccmap[root_cqe_group.clos].cbm;
+ ccmap[root_cqe_group.clos].ref++;
+
+ barrier();
+ cqe_genable = true;
+
+ pr_info("CQE enabled cbmlength is %u\ncqe Closs : %u ",
+ c->x86_cqe_cbmlength, c->x86_cqe_closs);
+
+ }
+
+ return 0;
+
+}
+
+late_initcall(cqe_late_init);
+
+/*
+ * Allocates a new closid from unused list of closids.
+ * Called with the cqe_group_mutex held.
+ */
+
+static int cqe_alloc_closid(struct cacheqe *cq)
+{
+ unsigned int tempid;
+ unsigned int maxid;
+ int err;
+
+ maxid = boot_cpu_data.x86_cqe_closs;
+
+ tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
+
+ if (tempid == maxid) {
+ err = -ENOSPC;
+ goto closidallocfail;
+ }
+
+ set_bit(tempid, cqess_info->closmap);
+ ccmap[tempid].ref++;
+ cq->clos = tempid;
+
+ pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
+
+ return 0;
+
+closidallocfail:
+
+ return err;
+
+}
+
+/*
+* Called with the cqe_group_mutex held.
+*/
+
+static void cqe_free_closid(struct cacheqe *cq)
+{
+
+ pr_debug("cqe :Freeing closid:%u\n", cq->clos);
+
+ ccmap[cq->clos].ref--;
+
+ if (!ccmap[cq->clos].ref)
+ clear_bit(cq->clos, cqess_info->closmap);
+
+ return;
+
+}
+
+/* Create a new cacheqe cgroup.*/
+static struct cgroup_subsys_state *
+cqe_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct cacheqe *parent = css_cacheqe(parent_css);
+ struct cacheqe *cq;
+
+ /* This is the call before the feature is detected */
+ if (!parent) {
+ root_cqe_group.clos = 0;
+ return &root_cqe_group.css;
+ }
+
+ /* To check if cqe is enabled.*/
+ if (!cqe_genable)
+ return ERR_PTR(-ENODEV);
+
+ cq = kzalloc(sizeof(struct cacheqe), GFP_KERNEL);
+ if (!cq)
+ return ERR_PTR(-ENOMEM);
+
+ /*
+ * Child inherits the ClosId and cbm from parent.
+ */
+
+ cq->clos = parent->clos;
+ mutex_lock(&cqe_group_mutex);
+ ccmap[parent->clos].ref++;
+ mutex_unlock(&cqe_group_mutex);
+
+ cq->cbm = parent->cbm;
+
+ pr_debug("cqe : Allocated cgroup closid:%u,ref:%u\n",
+ cq->clos, ccmap[parent->clos].ref);
+
+ return &cq->css;
+
+}
+
+/* Destroy an existing CAT cgroup.*/
+static void cqe_css_free(struct cgroup_subsys_state *css)
+{
+ struct cacheqe *cq = css_cacheqe(css);
+ int len = boot_cpu_data.x86_cqe_cbmlength;
+
+ pr_debug("cqe : In cacheqe_css_free\n");
+
+ mutex_lock(&cqe_group_mutex);
+
+ /* Reset the CBM for the cgroup.Should be all 1s by default !*/
+
+ wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
+ cqe_free_closid(cq);
+ kfree(cq);
+
+ mutex_unlock(&cqe_group_mutex);
+
+}
+
+/*
+ * Called during do_exit() syscall during a task exit.
+ * This assumes that the thread is running on the current
+ * cpu.
+ */
+
+static void cqe_exit(struct cgroup_subsys_state *css,
+ struct cgroup_subsys_state *old_css,
+ struct task_struct *task)
+{
+
+ cacheqe_sched_out(task);
+
+}
+
+static inline bool cbm_minbits(unsigned long var)
+{
+
+ unsigned long i;
+
+ /*Minimum of 2 bits must be set.*/
+
+ i = var & (var - 1);
+ if (!i || !var)
+ return false;
+
+ return true;
+
+}
+
+/*
+ * Tests if only contiguous bits are set.
+ */
+
+static inline bool cbm_iscontiguous(unsigned long var)
+{
+
+ unsigned long i;
+
+ /* Reset the least significant bit.*/
+ i = var & (var - 1);
+
+ /*
+ * We would have a set of non-contiguous bits when
+ * there is at least one zero
+ * between the most significant 1 and least significant 1.
+ * In the below '&' operation,(var <<1) would have zero in
+ * at least 1 bit position in var apart from least
+ * significant bit if it does not have contiguous bits.
+ * Multiple sets of contiguous bits wont succeed in the below
+ * case as well.
+ */
+
+ if (i != (var & (var << 1)))
+ return false;
+
+ return true;
+
+}
+
+static int cqe_cbm_read(struct seq_file *m, void *v)
+{
+ struct cacheqe *cq = css_cacheqe(seq_css(m));
+
+ pr_debug("cqe : In cqe_cqemode_read\n");
+ seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
+
+ return 0;
+
+}
+
+static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue)
+{
+ struct cacheqe *par, *c;
+ struct cgroup_subsys_state *css;
+
+ if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
+ pr_info("CQE error: minimum bits not set or non contiguous mask\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Needs to be a subset of its parent.
+ */
+ par = parent_cqe(cq);
+
+ if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
+ return -EINVAL;
+
+ rcu_read_lock();
+
+ /*
+ * Each of children should be a subset of the mask.
+ */
+
+ cacheqe_for_each_child(c, css, cq) {
+ c = css_cacheqe(css);
+ if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
+ pr_debug("cqe : Children's cbm not a subset\n");
+ return -EINVAL;
+ }
+ }
+
+ rcu_read_unlock();
+
+ return 0;
+
+}
+
+static bool cbm_search(unsigned long cbm, int *closid)
+{
+
+ int maxid = boot_cpu_data.x86_cqe_closs;
+ unsigned int i;
+
+ for (i = 0; i < maxid; i++)
+ if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
+ *closid = i;
+ return true;
+ }
+
+ return false;
+
+}
+
+static int cqe_cbm_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 cbmvalue)
+{
+ struct cacheqe *cq = css_cacheqe(css);
+ ssize_t err = 0;
+ unsigned long cbm;
+ unsigned int closid;
+
+ pr_debug("cqe : In cqe_cbm_write\n");
+
+ if (!cqe_genable)
+ return -ENODEV;
+
+ if (cq == &root_cqe_group || !cq)
+ return -EPERM;
+
+ /*
+ * Need global mutex as cbm write may allocate the closid.
+ */
+
+ mutex_lock(&cqe_group_mutex);
+ cbm = (cbmvalue & IA32_CBM_MASK);
+
+ if (bitmap_equal(&cbm, cq->cbm, MAX_CBM_LENGTH))
+ goto cbmwriteend;
+
+ err = validate_cbm(cq, cbm);
+ if (err)
+ goto cbmwriteend;
+
+ /*
+ * Need to assign a CLOSid to the cgroup
+ * if it has a new cbm , or reuse.
+ * This takes care to allocate only
+ * the number of CLOSs available.
+ */
+
+ cqe_free_closid(cq);
+
+ if (cbm_search(cbm, &closid)) {
+ cq->clos = closid;
+ ccmap[cq->clos].ref++;
+
+ } else {
+
+ err = cqe_alloc_closid(cq);
+
+ if (err)
+ goto cbmwriteend;
+
+ wrmsrl(CQECBMMSR(cq->clos), cbm);
+
+ }
+
+ /*
+ * Finally store the cbm in cbm map
+ * and store a reference in the cq.
+ */
+
+ ccmap[cq->clos].cbm = cbm;
+ cq->cbm = &ccmap[cq->clos].cbm;
+
+ cbmmap_dump();
+
+cbmwriteend:
+
+ mutex_unlock(&cqe_group_mutex);
+ return err;
+
+}
+
+static struct cftype cqe_files[] = {
+ {
+ .name = "cbm",
+ .seq_show = cqe_cbm_read,
+ .write_u64 = cqe_cbm_write,
+ .mode = 0666,
+ },
+ { } /* terminate */
+};
+
+struct cgroup_subsys cacheqe_cgrp_subsys = {
+ .name = "cacheqe",
+ .css_alloc = cqe_css_alloc,
+ .css_free = cqe_css_free,
+ .exit = cqe_exit,
+ .base_cftypes = cqe_files,
+};
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 4b4f78c..a9b277a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
c->x86_capability[9] = ebx;
}

+/* Additional Intel-defined flags: level 0x00000010 */
+ if (c->cpuid_level >= 0x00000010) {
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+
+ c->x86_capability[10] = ebx;
+
+ if (cpu_has(c, X86_FEATURE_CQE_L3)) {
+
+ u32 eax, ebx, ecx, edx;
+
+ cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+
+ c->x86_cqe_closs = (edx & 0xffff) + 1;
+ c->x86_cqe_cbmlength = (eax & 0xf) + 1;
+
+ }
+
+ }
+
/* Extended state features: level 0x0000000d */
if (c->cpuid_level >= 0x0000000d) {
u32 eax, ebx, ecx, edx;
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 98c4f9b..a131c1e 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -53,6 +53,11 @@ SUBSYS(hugetlb)
#if IS_ENABLED(CONFIG_CGROUP_DEBUG)
SUBSYS(debug)
#endif
+
+#if IS_ENABLED(CONFIG_CGROUP_CACHEQE)
+SUBSYS(cacheqe)
+#endif
+
/*
* DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
*/
diff --git a/init/Kconfig b/init/Kconfig
index 2081a4d..bec92a4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -968,6 +968,28 @@ config CPUSETS

Say N if unsure.

+config CGROUP_CACHEQE
+ bool "Cache QoS Enforcement cgroup subsystem"
+ depends on X86 || X86_64
+ help
+ This option provides framework to allocate Cache cache lines when
+ applications fill cache.
+ This can be used by users to configure how much cache that can be
+ allocated to different PIDs.
+
+ Say N if unsure.
+
+config CACHEQE_DEBUG
+ bool "Cache QoS Enforcement cgroup subsystem debug"
+ depends on X86 || X86_64
+ help
+ This option provides framework to allocate Cache cache lines when
+ applications fill cache.
+ This can be used by users to configure how much cache that can be
+ allocated to different PIDs.Enables debug
+
+ Say N if unsure.
+
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..afa2897 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
- prepare_arch_switch(next);
+ prepare_arch_switch(prev);
}

/**
@@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
*/
prev_state = prev->state;
vtime_task_switch(prev);
- finish_arch_switch(prev);
+ finish_arch_switch(current);
perf_event_task_sched_in(prev, current);
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..79b9ff6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
return p->on_rq == TASK_ON_RQ_MIGRATING;
}

+#ifdef CONFIG_X86_64
+#ifdef CONFIG_CGROUP_CACHEQE
+
+#include <asm/cacheqe.h>
+
+# define prepare_arch_switch(prev) cacheqe_sched_out(prev)
+# define finish_arch_switch(current) cacheqe_sched_in(current)
+
+#else
+
#ifndef prepare_arch_switch
# define prepare_arch_switch(next) do { } while (0)
#endif
#ifndef finish_arch_switch
# define finish_arch_switch(prev) do { } while (0)
#endif
+
+#endif
+#else
+
+#ifndef prepare_arch_switch
+# define prepare_arch_switch(prev) do { } while (0)
+#endif
+
+#ifndef finish_arch_switch
+# define finish_arch_switch(current) do { } while (0)
+#endif
+
+#endif
+
#ifndef finish_arch_post_lock_switch
# define finish_arch_post_lock_switch() do { } while (0)
#endif
--
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/