Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide

From: Marcelo Tosatti
Date: Mon Mar 30 2015 - 21:18:33 EST

Next message: Michael Turquette: "Re: [PATCH 2/6] clk: mediatek: Add initial common clock support for Mediatek SoCs."
Previous message: Brian Norris: "Re: [PATCH] mtd: nand: gpmi: fixup return type of wait_for_completion_timeout"
In reply to: Marcelo Tosatti: "Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide"
Next in thread: Vikas Shivappa: "Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Mar 26, 2015 at 10:29:27PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 26, 2015 at 11:38:59AM -0700, Vikas Shivappa wrote:
> >
> > Hello Marcelo,
>
> Hi Vikas,
>
> > On Wed, 25 Mar 2015, Marcelo Tosatti wrote:
> >
> > >On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
> > >>This patch adds a description of Cache allocation technology, overview
> > >>of kernel implementation and usage of CAT cgroup interface.
> > >>
> > >>Signed-off-by: Vikas Shivappa <vikas.shivappa@xxxxxxxxxxxxxxx>
> > >>---
> > >> Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
> > >> 1 file changed, 183 insertions(+)
> > >> create mode 100644 Documentation/cgroups/rdt.txt
> > >>
> > >>diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> > >>new file mode 100644
> > >>index 0000000..98eb4b8
> > >>--- /dev/null
> > >>+++ b/Documentation/cgroups/rdt.txt
> > >>@@ -0,0 +1,183 @@
> > >>+ RDT
> > >>+ ---
> > >>+
> > >>+Copyright (C) 2014 Intel Corporation
> > >>+Written by vikas.shivappa@xxxxxxxxxxxxxxx
> > >>+(based on contents and format from cpusets.txt)
> > >>+
> > >>+CONTENTS:
> > >>+=========
> > >>+
> > >>+1. Cache Allocation Technology
> > >>+ 1.1 What is RDT and CAT ?
> > >>+ 1.2 Why is CAT needed ?
> > >>+ 1.3 CAT implementation overview
> > >>+ 1.4 Assignment of CBM and CLOS
> > >>+ 1.5 Scheduling and Context Switch
> > >>+2. Usage Examples and Syntax
> > >>+
> > >>+1. Cache Allocation Technology(CAT)
> > >>+===================================
> > >>+
> > >>+1.1 What is RDT and CAT
> > >>+-----------------------
> > >>+
> > >>+CAT is a part of Resource Director Technology(RDT) or Platform Shared
> > >>+resource control which provides support to control Platform shared
> > >>+resources like cache. Currently Cache is the only resource that is
> > >>+supported in RDT.
> > >>+More information can be found in the Intel SDM section 17.15.
> > >>+
> > >>+Cache Allocation Technology provides a way for the Software (OS/VMM)
> > >>+to restrict cache allocation to a defined 'subset' of cache which may
> > >>+be overlapping with other 'subsets'. This feature is used when
> > >>+allocating a line in cache ie when pulling new data into the cache.
> > >>+The programming of the h/w is done via programming MSRs.
> > >>+
> > >>+The different cache subsets are identified by CLOS identifier (class
> > >>+of service) and each CLOS has a CBM (cache bit mask). The CBM is a
> > >>+contiguous set of bits which defines the amount of cache resource that
> > >>+is available for each 'subset'.
> > >>+
> > >>+1.2 Why is CAT needed
> > >>+---------------------
> > >>+
> > >>+The CAT enables more cache resources to be made available for higher
> > >>+priority applications based on guidance from the execution
> > >>+environment.
> > >>+
> > >>+The architecture also allows dynamically changing these subsets during
> > >>+runtime to further optimize the performance of the higher priority
> > >>+application with minimal degradation to the low priority app.
> > >>+Additionally, resources can be rebalanced for system throughput
> > >>+benefit. (Refer to Section 17.15 in the Intel SDM)
> > >>+
> > >>+This technique may be useful in managing large computer systems which
> > >>+large LLC. Examples may be large servers running instances of
> > >>+webservers or database servers. In such complex systems, these subsets
> > >>+can be used for more careful placing of the available cache
> > >>+resources.
> > >>+
> > >>+The CAT kernel patch would provide a basic kernel framework for users
> > >>+to be able to implement such cache subsets.
> > >>+
> > >>+1.3 CAT implementation Overview
> > >>+-------------------------------
> > >>+
> > >>+Kernel implements a cgroup subsystem to support cache allocation.
> > >>+
> > >>+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> > >>+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> > >>+to the kernel and not exposed to user. Each cgroup would have one CBM
> > >>+and would just represent one cache 'subset'.
> > >>+
> > >>+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> > >>+cgroup never fails. When a child cgroup is created it inherits the
> > >>+CLOSid and the CBM from its parent. When a user changes the default
> > >>+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> > >>+used before. The changing of 'cbm' may fail with -ERRNOSPC once the
> > >>+kernel runs out of maximum CLOSids it can support.
> > >>+User can create as many cgroups as he wants but having different CBMs
> > >>+at the same time is restricted by the maximum number of CLOSids
> > >>+(multiple cgroups can have the same CBM).
> > >>+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> > >>+for each cgroup using a CLOSid.
> > >>+
> > >>+The tasks in the cgroup would get to fill the LLC cache represented by
> > >>+the cgroup's 'cbm' file.
> > >>+
> > >>+Root directory would have all available bits set in 'cbm' file by
> > >>+default.
> > >>+
> > >>+1.4 Assignment of CBM,CLOS
> > >>+--------------------------
> > >>+
> > >>+The 'cbm' needs to be a subset of the parent node's 'cbm'.
> > >>+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
> > >>+set to indicate the cache mapping desired. The 'cbm' between 2
> > >>+directories can overlap. The 'cbm' would represent the cache 'subset'
> > >>+of the CAT cgroup. For ex: on a system with 16 bits of max cbm bits,
> > >>+if the directory has the least significant 4 bits set in its 'cbm'
> > >>+file(meaning the 'cbm' is just 0xf), it would be allocated the right
> > >>+quarter of the Last level cache which means the tasks belonging to
> > >>+this CAT cgroup can use the right quarter of the cache to fill. If it
> > >>+has the most significant 8 bits set ,it would be allocated the left
> > >>+half of the cache(8 bits out of 16 represents 50%).
> > >>+
> > >>+The cache portion defined in the CBM file is available to all tasks
> > >>+within the cgroup to fill and these task are not allowed to allocate
> > >>+space in other parts of the cache.
> > >
> > >Is there a reason to expose the hardware interface rather
> > >than ratios to userspace ?
> > >
> > >Say, i'd like to allocate 20% of L3 cache to cgroup A,
> > >80% to cgroup B.
> > >
> > >Well, you'd have to expose the shared percentages between
> > >any two cgroups (that information is there in the
> > >cbm bitmaps, but not in "ratios").
> > >
> > >One problem i see with exposing cbm bitmasks is that on hardware
> > >updates that change cache size or bitmask length, userspace must
> > >recalculate the bitmaps.
> > >
> > >Another is that its vendor dependant, while ratios (plus shared
> > >information for two given cgroups) is not.
> > >
> >
> > Agree that this interface doesnot give options to directly allocate
> > in terms of percentage . But note that specifying in bitmasks allows
> > the user to allocate overlapping cache areas and also since we use
> > cgroup we naturally follow the cgroup hierarchy. User should be able
> > to convert the bitmasks into intended percentage or size values
> > based on the other available cache size info in hooks like cpuinfo.
> >
> > We discussed more on this before in the older patches and here is
> > one thread where we discussed it for your reference -
> > http://marc.info/?l=linux-kernel&m=142482002022543&w=2
> >
> > Thanks,
> > Vikas
>
> I can't find any discussion relating to exposing the CBM interface
> directly to userspace in that thread ?
>
> Cpu.shares is written in ratio form, which is much more natural.
> Do you see any advantage in maintaining the
>
> (ratio -> cbm bitmasks)
>
> translation in userspace rather than in the kernel ?
>
> What about something like:
>
>
> root cgroup
> / \
> / \
> / \
> cgroupA-80 cgroupB-30
>
>
> So that whatever exceeds 100% is the ratio of cache
> shared at that level (cgroup A and B share 10% of cache
> at that level).
>
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
>
> cpu â the cpu.shares parameter determines the share of CPU resources
> available to each process in all cgroups. Setting the parameter to 250,
> 250, and 500 in the finance, sales, and engineering cgroups respectively
> means that processes started in these groups will split the resources
> with a 1:1:2 ratio. Note that when a single process is running, it
> consumes as much CPU as necessary no matter which cgroup it is placed
> in. The CPU limitation only comes into effect when two or more processes
> compete for CPU resources.

Vikas,

I see the following resource specifications from the POV of a user/admin:

1) Ratios.

X%/Y%, as discussed above.

2) Specific kilobyte values.

In accord with the rest of cgroups, allow specific kilobyte
specification. See limit_in_bytes, for example, from

https://www.kernel.org/doc/Documentation/cgroups/memory.txt

Of course you would have to convert to way units, but i see
two use-cases here:

- User wants application to not reclaim more than
given number of kilobytes of LLC cache.
- User wants application to be guaranteed a given
amount of kilobytes of LLC, even across processor changes.

Again, some precision is lost with LLC.

3) Per-CPU differentiation

The current patchset deals with the following use-case suboptimally:

CPU1-4 CPU5-8

die1 die2

* Task groupA is isolated to CPU-8 (die2).
* Task groupA has 50% cache reserved.
* Task groupB can reclaim into 50% cache.
* Task groupB can reclaim into 100% of cache
of die1.

I suppose this is a common scenario which is not handled by
the current patchset (you would have task groupB use only 50%
of cache of die1).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Michael Turquette: "Re: [PATCH 2/6] clk: mediatek: Add initial common clock support for Mediatek SoCs."
Previous message: Brian Norris: "Re: [PATCH] mtd: nand: gpmi: fixup return type of wait_for_completion_timeout"
In reply to: Marcelo Tosatti: "Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide"
Next in thread: Vikas Shivappa: "Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]