[PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

From: Vikas Shivappa
Date: Thu Aug 06 2015 - 17:56:30 EST


Adds a description of Cache allocation technology, overview of kernel
implementation and usage of Cache Allocation cgroup interface.

Cache allocation is a sub-feature of Resource Director Technology (RDT)
or Platform Shared resource control which provides support to control
Platform shared resources like L3 cache.

Cache Allocation Technology provides a way for the Software (OS/VMM) to
restrict cache allocation to a defined 'subset' of cache which may be
overlapping with other 'subsets'. This feature is used when allocating a
line in cache ie when pulling new data into the cache. The tasks are
grouped into CLOS (class of service). OS uses MSR writes to indicate the
CLOSid of the thread when scheduling in and to indicate the cache
capacity associated with the CLOSid. Currently cache allocation is
supported for L3 cache.

More information can be found in the Intel SDM June 2015, Volume 3,
section 17.16.

Signed-off-by: Vikas Shivappa <vikas.shivappa@xxxxxxxxxxxxxxx>
---
Documentation/cgroups/rdt.txt | 219 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 219 insertions(+)
create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..1abc930
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,219 @@
+ RDT
+ ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@xxxxxxxxxxxxxxx
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+ 1.1 What is RDT and Cache allocation ?
+ 1.2 Why is Cache allocation needed ?
+ 1.3 Cache allocation implementation overview
+ 1.4 Assignment of CBM and CLOS
+ 1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology
+===================================
+
+1.1 What is RDT and Cache allocation
+------------------------------------
+
+Cache allocation is a sub-feature of Resource Director Technology (RDT)
+Allocation or Platform Shared resource control which provides support to
+control Platform shared resources like L3 cache. Currently L3 Cache is
+the only resource that is supported in RDT. More information can be
+found in the Intel SDM June 2015, Volume 3, section 17.16.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM) to
+restrict cache allocation to a defined 'subset' of cache which may be
+overlapping with other 'subsets'. This feature is used when allocating a
+line in cache ie when pulling new data into the cache. The programming
+of the h/w is done via programming MSRs.
+
+The different cache subsets are identified by CLOS identifier (class of
+service) and each CLOS has a CBM (cache bit mask). The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is Cache allocation needed
+----------------------------------
+
+In todays new processors the number of cores is continuously increasing
+especially in large scale usage models where VMs are used like
+webservers and datacenters. The number of cores increase the number of
+threads or workloads that can simultaneously be run. When
+multi-threaded-applications, VMs, workloads run concurrently they
+compete for shared resources including L3 cache.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput benefit.
+This technique may be useful in managing large computer systems which
+large L3 cache. Examples may be large servers running instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache resources.
+
+A specific use case may be when a app which is constantly copying data
+like streaming app is using large amount of cache which could have
+otherwise been used by a high priority computing application. Using the
+cache allocation feature, the streaming application can be confined to
+use a smaller cache and the high priority application be awarded a
+larger amount of cache space.
+
+1.3 Cache allocation implementation Overview
+--------------------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation. This
+is intended for use by root users or system administrators to have
+control over the amount of L3 cache the applications can use to fill.
+
+Each cgroup has a CLOSid <-> CBM (cache bit mask) mapping. A CLOS (Class
+of service) is represented by a CLOSid. CLOSid is internal to the kernel
+and not exposed to user. Each cgroup would have one CBM and would just
+represent one cache 'subset'.
+
+When a child cgroup is created it inherits the CLOSid and the CBM from
+its parent. When a user changes the default CBM for a cgroup, a new
+CLOSid may be allocated if the CBM was not used before. The changing of
+'l3_cbm' may fail with -ENOSPC once the kernel runs out of maximum
+CLOSids it can support. User can create as many cgroups as he wants but
+having different CBMs at the same time is restricted by the maximum
+number of CLOSids (multiple cgroups can have the same CBM). Kernel
+maintains a CLOSid <-> cbm mapping which keeps reference counter for
+each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the L3 cache represented by
+the cgroup's 'l3_cbm' file.
+
+Root directory would have all available bits set in 'l3_cbm' file by
+default.
+
+Each RDT cgroup directory has the following files. Some of them may be a
+part of common RDT framework or be specific to RDT sub-features like
+cache allocation.
+
+ - intel_rdt.l3_cbm: The cache bitmask (CBM) is represented by this
+ file. The bitmask must be contiguous and would have a 1 or 2 bit
+ minimum length.
+
+1.4 Assignment of CBM, CLOS
+---------------------------
+
+The 'l3_cbm' needs to be a subset of the parent node's 'l3_cbm'. Any
+contiguous subset of these bits (with a minimum of 2 bits on hsw server
+SKUs) maybe set to indicate the cache capacity desired. The 'l3_cbm'
+between 2 directories can overlap. The 'l3_cbm' would represent the
+cache 'subset' of the Cache allocation cgroup.
+
+For ex: on a system with 16 bits of max cbm bits and 4MB of L3 cache, if
+the directory has the least significant 4 bits set in its 'l3_cbm' file
+(meaning the 'l3_cbm' is just 0xf), it would be allocated 1MB of the L3
+cache which means the tasks belonging to this Cache allocation cgroup
+can use that 1MB cache to fill. If it has a l3_cbm=0xff, it would be
+allocated 2MB or half of the cache. The administrator who is using the
+cgroup can easily determine the total size of the cache from
+/proc/cpuinfo and decide the amount or specific percentage of cache
+allocations to be made to applications. Note that the granularity may
+differ on different SKUs and the administrator can obtain the
+granularity by calculating total size of cache / max cbm bits.
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the CLOSid
+(internally maintained by kernel) of the cgroup to which the task
+belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written when
+there is a change in the CLOSid for the CPU in order to minimize the
+latency incurred during context switch.
+
+The following considerations are done for the PQR MSR write so that it
+has minimal impact on scheduling hot path:
+ - This path doesn't exist on any non-intel platforms.
+ - On Intel platforms, this would not exist by default unless CGROUP_RDT
+ is enabled.
+ - remains a no-op when CGROUP_RDT is enabled and intel hardware does
+ not support the feature.
+ - When feature is available, does not do MSR write till the user
+ manually creates a cgroup *and* assigns a new cache mask. Since the
+ child node inherits the parents cache mask, by cgroup creation there is
+ no scheduling hot path impact from the new cgroup.
+ - per cpu PQR values are cached and the MSR write is only done when
+ there is a task with different PQR is scheduled on the CPU. Typically
+ if the task groups are bound to be scheduled on a set of CPUs, the
+ number of MSR writes is greatly reduced.
+
+2. Usage examples and syntax
+============================
+
+Following is an example on how a system administrator/root user can
+configure L3 cache allocation to threads.
+
+To check if Cache allocation was enabled on your system
+ $ dmesg | grep -i intel_rdt
+ intel_rdt: Intel Cache Allocation enabled
+
+ $ cat /proc/cpuinfo
+output would have 'rdt' (if rdt is enabled) and 'cat_l3' (if L3
+cache allocation is enabled).
+
+example1: Following would mount the cache allocation cgroup subsystem
+and create 2 directories.
+
+ $ cd /sys/fs/cgroup
+ $ mkdir rdt
+ $ mount -t cgroup -ointel_rdt intel_rdt /sys/fs/cgroup/rdt
+ $ cd rdt
+ $ mkdir group1
+ $ mkdir group2
+
+Following are some of the Files in the directory
+
+ $ ls
+ intel_rdt.l3_cbm
+ tasks
+
+Say if the cache is 4MB (looked up from /proc/cpuinfo) and max cbm is 16
+bits (indicated by the root nodes cbm). This assigns 1MB of cache to
+group1 and group2 which is exclusive between them.
+
+ $ cd group1
+ $ /bin/echo 0xf > intel_rdt.l3_cbm
+
+ $ cd group2
+ $ /bin/echo 0xf0 > intel_rdt.l3_cbm
+
+Assign tasks to the group2
+
+ $ /bin/echo PID1 > tasks
+ $ /bin/echo PID2 > tasks
+
+Now threads PID1 and PID2 get to fill the 1MB of cache that was
+allocated to group2. Similarly assign tasks to group1.
+
+example2: Below commands allocate '1MB L3 cache on socket1 to group1'
+and '2MB of L3 cache on socket2 to group2'.
+This mounts both cpuset and intel_rdt and hence the ls would list the
+files in both the subsystems.
+ $ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
+
+Assign the cache
+ $ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
+ $ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm
+
+Assign tasks for group1 and group2
+ $ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
+ $ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
+ $ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
+ $ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks
+
+Tie the group1 to socket1 and group2 to socket2
+ $ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
+ $ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus
--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/