Re: [PATCH] x86/cqm: Cqm3 Documentation

From: Shivappa Vikas
Date: Mon Apr 03 2017 - 19:52:02 EST



On Mon, 3 Apr 2017, Vikas Shivappa wrote:

Explains the design for the interface

Explains the design for the new resctrl based cqm interface. A followup
with design document after the requirements for new cqm was reviewed :
https://marc.info/?l=linux-kernel&m=148891934720489


Signed-off-by: Vikas Shivappa <vikas.shivappa@xxxxxxxxxxxxxxx>
---
Documentation/x86/intel_rdt_ui.txt | 210 ++++++++++++++++++++++++++++++++++---
1 file changed, 197 insertions(+), 13 deletions(-)

diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index d918d26..46a2efd 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -1,12 +1,13 @@
-User Interface for Resource Allocation in Intel Resource Director Technology
+User Interface for Resource Allocation and Monitoring in Intel Resource
+Director Technology

Copyright (C) 2016 Intel Corporation

Fenghua Yu <fenghua.yu@xxxxxxxxx>
Tony Luck <tony.luck@xxxxxxxxx>

-This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
-X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".

To use the feature mount the file system:

@@ -16,14 +17,20 @@ mount options are:

"cdp": Enable code/data prioritization in L3 cache allocations.

+The mount succeeds if either of allocation or monitoring is present.
+Monitoring is enabled for each resource which has support in the
+hardware. For more details on the behaviour of the interface during
+monitoring and allocation, see resctrl group section.

Info directory
--------------

The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
-names reflect the resource names. Each subdirectory contains the
-following files:
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:

"num_closids": The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@@ -35,15 +42,36 @@ following files:
"min_cbm_bits": The minimum number of consecutive bits which must be
set when writing a mask.

+Each subdirectory contains the following files with respect to
+monitoring:
+
+"num_rmids": The number of RMIDs which are valid for
+ this resource.
+
+"mon_enabled": Indicates if monitoring is enabled for
+ the resource.

-Resource groups
----------------
+"max_threshold_occupancy": This is specific to LLC_occupancy
+ monitoring. provides an upper bound on
+ the threshold and is measured in bytes
+ because it's exposed to userland.
+
+Resource alloc and monitor groups (ALLOC_MON group)
+---------------------------------------------------
Resource groups are represented as directories in the resctrl file
system. The default group is the root directory. Other groups may be
created as desired by the system administrator using the "mkdir(1)"
command, and removed using "rmdir(1)".

-There are three files associated with each group:
+User can allocate resources and monitor resources via these
+resource groups created in the root directory.
+
+Note that the creation of new ALLOC_MON groups is only allowed when RDT
+allocation is supported. This means user can still monitor the root
+group when only RDT monitoring is supported.
+
+There are three files associated with each group with respect to
+resource allocation:

"tasks": A list of tasks that belongs to this group. Tasks can be
added to a group by writing the task ID to the "tasks" file
@@ -75,6 +103,56 @@ the CPU's group is used.

3) Otherwise the schemata for the default group is used.

+There are three files associated with each group with respect to
+resource monitoring:
+
+"data": A list of all the monitored resource data available to this
+ group. This includes the monitored data for all the tasks in the
+ 'tasks' and the cpus in 'cpus' file. Each resource has its own
+ line and format - see below for details the 'data' file
+ description. The monitored data for
+ the ALLOC_MON group is the sum of all the data for its sub MON
+ groups.
+
+"mon_tasks": A directory where in user can create Resource monitor
+ groups (MON groups). This will let user create a group to
+ monitor a subset of tasks in the above 'tasks' file.
+
+Resource monitor groups (MON group)
+-----------------------------------
+
+Resource monitor groups are directories inside the mon_tasks directory.
+There is one mon_tasks directory inside every ALLOC_MON group including
+the root group.
+
+MON group help user monitor a subset of tasks and cpus with in
+the parent ALLOC_MON group.
+
+Each MON group has 3 files:
+
+"tasks": This behaves exactly as the 'tasks' file above in the ALLOC_MON
+ group with the added restriction that only a task present in the
+ parent ALLOC_MON group can be added and this automatically
+ removes the task from the "tasks" file of any other MON group.
+ When a task gets removed from parent ALLOC_MON group the task is
+ removed from "tasks" file in the child MON group.
+
+"cpus": This behaves exactly as the 'cpus' file above in the ALLOC_MON
+ group with the added restriction that only a cpu present in the
+ parent ALLOC_MON group can be added and this automatically
+ removes the task from the "cpus" file of any other MON group.
+ When a cpu gets removed from parent ALLOC_MON group the cpu is
+ removed from "cpus" file in the child MON group.
+
+"data": A list of all the monitored resource data available to
+ this group. Each resource has its own line and format - see
+ below for details in the 'data' file description.
+
+data files - general concepts
+-----------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by monitoring data collected
+in each of the instances/domains of that resource on the system.

Schemata files - general concepts
---------------------------------
@@ -107,21 +185,26 @@ and 0xA are not. On a system with a 20-bit mask each bit represents 5%
of the capacity of the cache. You could partition the cache into four
equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.

-
-L3 details (code and data prioritization disabled)
---------------------------------------------------
+L3 'schemata' file format (code and data prioritization disabled)
+----------------------------------------------------------------
With CDP disabled the L3 schemata format is:

L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

-L3 details (CDP enabled via mount option to resctrl)
-----------------------------------------------------
+L3 'schemata' file format (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:

L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

+L3 'data' file format (data)
+---------------------------
+When monitoring is enabled for L3 occupancy the 'data' file format is:
+
+ L3:<cache_id0>=<llc_occupancy>;<cache_id1>=<llc_occupancy>;...
+
L2 details
----------
L2 cache does not support code and data prioritization, so the
@@ -129,6 +212,8 @@ schemata format is always:

L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...

+Examples for RDT allocation usage:
+
Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
@@ -212,3 +297,102 @@ Finally we move core 4-7 over to the new group and make sure that the
kernel and the tasks running there get 50% of the cache.

# echo C0 > p0/cpus
+
+Examples for RDT Monitoring usage:
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+---------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+# echo 5678 > p1/tasks
+# echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups
+
+# cd /sys/fs/resctrl/p1/mon_tasks
+# mkdir m11 m12
+# echo 5678 > m11/tasks
+# echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+
+# cat m11/tasks_data
+L3:0=16234000;1=14789000
+# cat m12/tasks_data
+L3:0=14234000;1=16789000
+
+The parent group shows the aggregated data.
+
+# cat /sys/fs/resctrl/p1/tasks_data
+L3:0=31234000;1=31789000
+
+Example 2 (Monitor a task from its creation)
+---------
+On a two socket machine (one L3 cache per socket)
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+
+# echo $$ > /sys/fs/resctrl/p1/tasks
+# echo <cmd> > /sys/fs/resctrl/p1/tasks
+
+Fetch the data
+
+# cat /sys/fs/resctrl/p1/tasks_data
+L3:0=31234000;1=31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them different allocation groups.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+
+# echo $$ > /sys/fs/resctrl/p1/tasks
+# echo <cmd> > /sys/fs/resctrl/p1/tasks
+
+# cat /sys/fs/resctrl/p1/tasks_data
+L3:0=31234000;1=31789000
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p1
+
+Move the cpus 4-7 over to p1
+# echo C0 > p0/cpus
+
+View the llc occupancy snapshot
+
+# cat /sys/fs/resctrl/p1/tasks_data
+L3:0=11234000
--
1.9.1