[RFC PATCH V2 01/22] x86/intel_rdt: Documentation for Cache Pseudo-Locking

From: Reinette Chatre
Date: Tue Feb 13 2018 - 18:53:31 EST


Add description of Cache Pseudo-Locking feature, its interface,
as well as an example of its usage.

Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
---
Documentation/x86/intel_rdt_ui.txt | 229 ++++++++++++++++++++++++++++++++++++-
1 file changed, 228 insertions(+), 1 deletion(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index 756fd76b78a6..bb3d6fe0a3e4 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -27,7 +27,10 @@ mount options are:
L2 and L3 CDP are controlled seperately.

RDT features are orthogonal. A particular system may support only
-monitoring, only control, or both monitoring and control.
+monitoring, only control, or both monitoring and control. Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".

The mount succeeds if either of allocation or monitoring is present, but
only those files and directories supported by the system will be created.
@@ -329,6 +332,149 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff

+Cache Pseudo-Locking
+--------------------
+CAT enables a user to specify the amount of cache space into which an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+âlockedâ data from cache. Power management C-states may shrink or
+power off cache. It is thus recommended to limit the processor maximum
+C-state, for example, by setting the processor.max_cstate kernel parameter.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. This is
+enforced by the implementation.
+
+Pseudo-locking is accomplished in two stages:
+1) During the first stage the system administrator allocates a portion
+ of cache that should be dedicated to pseudo-locking. At this time an
+ equivalent portion of memory is allocated, loaded into allocated
+ cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+ pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+Platforms supporting cache pseudo-locking will expose a new
+"/sys/fs/restrl/pseudo_lock" directory after successful mount of the
+resctrl filesystem. Initially this directory will contain a single file,
+"avail" that contains the schemata, one line per resource, of cache region
+available for pseudo-locking.
+
+A pseudo-locked region is created by creating a new directory within
+/sys/fs/resctrl/pseudo_lock. On success two new files will appear in
+the directory:
+
+"schemata":
+ Shows the schemata representing the pseudo-locked cache region.
+ User writes schemata of requested locked area to file.
+ Only one id of single resource accepted - can only lock from
+ single cache instance. Writing of schemata to this file will
+ return success on successful pseudo-locked region setup.
+"size":
+ After successful pseudo-locked region setup this read-only file
+ will contain the size in bytes of pseudo-locked region.
+
+Cache Pseudo-Locking Debugging Interface
+---------------------------------------
+The pseudo-locking debugging interface is enabled with
+CONFIG_INTEL_RDT_DEBUGFS and can be found in
+/sys/kernel/debug/resctrl/pseudo_lock.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+ from these measurements are best visualized using a hist trigger (see
+ example below). In this test the pseudo-locked region is traversed at
+ a stride of 32 bytes while hardware prefetchers, preemption, and interrupts
+ are disabled. This also provides a substitute visualization of cache
+ hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+ available. Depending on the levels of cache on the system the following
+ tracepoints are available: pseudo_lock_l2_hits, pseudo_lock_l2_miss,
+ pseudo_lock_l3_miss, and pseudo_lock_l3_hits. WARNING: triggering this
+ measurement uses from two (for just L2 measurements) to four (for L2 and L3
+ measurements) precision counters on the system, if any other
+ measurements are in progress the counters and their corresponding event
+ registers will be clobbered.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/pseudo_lock/<newdir>. A single
+write-only file, measure_trigger, is present in this directory. The
+measurement on the pseudo-locked region depends on the number, 1 or 2,
+written to this debugfs file. Since the measurements are recorded with the
+tracing infrastructure the relevant tracepoints need to be enabled before the
+measurement is triggered.
+
+Example of latency debugging interface:
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region:
+# :> /sys/kernel/debug/tracing/trace
+# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/trigger
+# echo 1 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/enable
+# echo 1 > /sys/kernel/debug/resctrl/pseudo_lock/newlock/measure_trigger
+# echo 0 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/enable
+# cat /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_mem_latency/hist
+
+# event histogram
+#
+# trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+#
+
+{ latency: 456 } hitcount: 1
+{ latency: 50 } hitcount: 83
+{ latency: 36 } hitcount: 96
+{ latency: 44 } hitcount: 174
+{ latency: 48 } hitcount: 195
+{ latency: 46 } hitcount: 262
+{ latency: 42 } hitcount: 693
+{ latency: 40 } hitcount: 3204
+{ latency: 38 } hitcount: 3484
+
+Totals:
+ Hits: 8192
+ Entries: 9
+ Dropped: 0
+
+Example of cache hits/misses debugging:
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+
+# :> /sys/kernel/debug/tracing/trace
+# echo 1 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_hits/enable
+# echo 1 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_miss/enable
+# echo 2 > /sys/kernel/debug/resctrl/pseudo_lock/newlock/measure_trigger
+# echo 0 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_hits/enable
+# echo 0 > /sys/kernel/debug/tracing/events/pseudo_lock/pseudo_lock_l2_miss/enable
+# cat /sys/kernel/debug/tracing/trace
+
+# tracer: nop
+#
+# _-----=> irqs-off
+# / _----=> need-resched
+# | / _---=> hardirq/softirq
+# || / _--=> preempt-depth
+# ||| / delay
+# TASK-PID CPU# |||| TIMESTAMP FUNCTION
+# | | | |||| | |
+ pseudo_lock_mea-1039 [002] .... 1598.825180: pseudo_lock_l2_hits: L2 hits=4097
+ pseudo_lock_mea-1039 [002] .... 1598.825184: pseudo_lock_l2_miss: L2 miss=2
+
Examples for RDT allocation usage:

Example 1
@@ -443,6 +589,87 @@ siblings and only the real time threads are scheduled on the cores 4-7.

# echo F0 > p0/cpus

+Example of Cache Pseudo-Locking
+-------------------------------
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+
+# cd /sys/fs/resctrl/pseudo_lock
+# cat avail
+L2:0=ff;1=ff
+# mkdir newlock
+# cd newlock
+# cat schemata
+# L2:uninitialized
+# echo âL2:1=3â > schemata
+# ls -l /dev/pseudo_lock/newlock
+crw------- 1 root root 244, 0 Mar 30 03:00 /dev/pseudo_lock/newlock
+
+/*
+ * Example code to access one page of pseudo-locked cache region
+ * from user space.
+ */
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/mman.h>
+
+/*
+ * It is required that the application runs with affinity to only
+ * cores associated with the pseudo-locked region. Here the cpu
+ * is hardcoded for convenience of example.
+ */
+static int cpuid = 2;
+
+int main(int argc, char *argv[])
+{
+ cpu_set_t cpuset;
+ long page_size;
+ void *mapping;
+ int dev_fd;
+ int ret;
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpuid, &cpuset);
+ ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+ if (ret < 0) {
+ perror("sched_setaffinity");
+ exit(EXIT_FAILURE);
+ }
+
+ dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+ if (dev_fd < 0) {
+ perror("open");
+ exit(EXIT_FAILURE);
+ }
+
+ mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ dev_fd, 0);
+ if (mapping == MAP_FAILED) {
+ perror("mmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Application interacts with pseudo-locked memory @mapping */
+
+ ret = munmap(mapping, page_size);
+ if (ret < 0) {
+ perror("munmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ close(dev_fd);
+ exit(EXIT_SUCCESS);
+}
+
4) Locking between applications

Certain operations on the resctrl filesystem, composed of read/writes
--
2.13.6