Re: [DESIGN] x86/intel_rdt: Intel Cache Allocation interface proposal

From: Vikas Shivappa
Date: Sun Aug 23 2015 - 14:56:12 EST



+Tony and acknowledging him..

On Sun, 23 Aug 2015, Vikas Shivappa wrote:

This document tries to propose alternative interface for the Intel cache
allocation compared to the cgroup interface in the current patchset -
http://marc.info/?l=linux-kernel&m=143889814520578

More info about cache allocation can be found in Intel SDM june 2015
volume 3, section 17.16.

Design overview:
---------------

OS maintains a mapping between task_struct and the class of service it
belongs to. This is done by adding a new field 'closid' in the
task_struct. Each closid is mapped to unique capacity bit mask(cbm)
which indicates the cache capacity associated to the closid.

During scheduing, kernel writes this closid into IA32_PQOS_MSR to
indicate the hardware as to what Class of service(CLOS) the task belongs
to.

It makes following changes to the current patch series :
- Add kernel mode API to control cache allocations from with in the OS.
- we dont use cgroup and instead expose controls through using sysfs in
/sys/kernel directory for the administrator to configure the cache
allocations.
- And optionally it also adds capabilities to add a control where
process can change the cache allocation under the defined allocations
by administrator.

The usecases targeted is mainly server clusters, cloud and container
based services and HPC workloads. Users of cloud or containers would
get a VM/container to run the workloads and its most appropriate to
setup the static cache allocations for these units like VM/Containers.
For containers, many of the container based products like
Rancher/stackengine etc are docker based and allocate/manage
resources through a centralized orchestration/deployment tool.
Containers are quickly picking up in usage given the ease of deployment
of new containers and the scaling. These cache alloc interfaces try to
build a framework so that such use cases like cloud and container based
can easily adapt.

Apps are restricted to self control the cache allocations as cache is
orders of magnitude scarce resource when we compare to other resources
like memory and will quickly run out of the resource if the apps
naturally try to use more of the resource to increase their own
performance.

kernel mode API:
---------------

enum cache_resource{
l3_shared,
};

struct cache_alloc_config
{
u32 max_cbm;
u32 max_closid;
unsigned long cache_size;
int cdp_mode;
};

struct clos_cbm_table {
unsigned long l3_cbm;
unsigned int clos_refcnt;
};

void cache_alloc_get_info(enum cache_resource cr, struct
cache_alloc_config &config);

This returns the cache allocation configuration information along with
the cache size. Additionational capabilities can be added for example
the current mode whether code data prioritization(supporting both
icache/dcache or legacy cache alloc).

int cache_alloc_set_cdpmode(bool setcdp);

By default cdp(code data prioritization which supports allocation of
code and data seperately instead of common cache allocation) is not
enabled and can be set/reset with this API. Enabling cdp would reset all
the capacity bit masks and reduce the number of CLOSids to half.
With cdp enabled the cbm can be extended to represent data and code
capacity mask (by having two u32).

void cache_alloc_get_cbm_table( struct clos_cbm_table *cctable, int
size);
Returns the mapping of the current closids to the capacity bit masks.

u32 cache_alloc_set_cbm( u32 preferred_closid, u32 cbm);

This reconfigures the capacity bitmask(cbm) for a preferred closid. If
the cbm is already present in the table, that closid is returned. That
way each unique cbm has one closid.

sysfs interface
---------------

This exposes files changeble by root in /sys/kernel/cache_alloc
directory.

clos_cbm_table :
Reading - this shows the max_cbm and the current snapshot of the
clos_cbm table.
writing - user can write the 'preferred closid' 'cbm' to change the
existing entry in the set of CLOS configs.
If user writes a bitmask that already
exists it outputs indicating what closid has the cbm.
$ echo <closid> <cbm> > /sys/kernel/cache_alloc/clos_cbm_table

Alternatively , instead of clos_cbm_table a directory for each clos
would be created with a file cbm in each directory.

add_task :
write only: Can change the closid of any task by writing the 'pid'
'closid'. eg:
$ echo <pid> <closid> > /sys/kernel/cache_alloc/add_task

threshold_clos : Can have two values 'lowest', 'all'. default to lowest.
When it lowest , a process can self change its closid to a different
closid but the new closid has to have the lowest capacity bitmask among
all the bitmasks. When its 'all' the process can change to any closid.
the interface is indicated below.

cdp_enable : takes 1/0 and by default is 0. Used to set cdp mode.

$ ls /sys/kernel/cache_alloc
add_task
threshold_clos
cdp_enable
clos0/
clos1/
...
closn/


The closid of the task can be viewed in the /proc/<tid>/ stats.
The tasks would have closid 0 by default and would inherit parents
closid upon fork.

prctl/ syscall interface for process to change cache alloc
----------------------------------------------------------

This lets a process change its own cache allocation. However the amount
of change that can be done is limited. This is because L3 cache is a
very limited/scarce resource and can easily be exhausted by the first
few processes requesting more amount of cache. And this also lets one
centralized entity or a system-controlled mechanism which can be used
only by administrator to have a higher control in deciding the cache
allocation which is more useful in the scenarios described above.

struct cat_config {
u32 max_cbm;
u32 max_clos;
unsigned long chunk_size;
int any_clos_allowed;
};

void cat_get_current_config(struct cat_config &config, struct
clos_cbm_table &cctable);

This returns the max clos and cbm length and the current mappings of the
closid and the capacity masks. It also returns the chunk_size which
specifies the size of cache capacity that corresponds to one bit of cbm.
any_clos_allowed will be true if the threshold_clos is set
to 'any'.

prctl(PR_SET_CLOSID, <new_closid>, ... );

Cache can be allocated in terms of bytes or percentages using this
interface. One can calculate the chunk size from the APIs and then
convert the size required to mask easily by using bitmask length = (size
required/ chunk size). Also the bitmask gives the flexibility to
have exclusive, completely overlapping or partially overlapping cache
areas which can be adjusted based on the requirements of the workloads.


















--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/