Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

From: Marcelo Tosatti
Date: Mon Aug 24 2015 - 09:08:43 EST


On Sun, Aug 23, 2015 at 11:47:49AM -0700, Vikas Shivappa wrote:
>
>
> On Fri, 21 Aug 2015, Marcelo Tosatti wrote:
>
> >On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote:
> >>
> >>
> >>On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
> >>
> >>>Vikas, Tejun,
> >>>
> >>>This is an updated interface. It addresses all comments made
> >>>so far and also covers all use-cases the cgroup interface
> >>>covers.
> >>>
> >>>Let me know what you think. I'll proceed to writing
> >>>the test applications.
> >>>
> >>>Usage model:
> >>>------------
> >>>
> >>>This document details how CAT technology is
> >>>exposed to userspace.
> >>>
> >>>Each task has a list of task cache reservation entries (TCRE list).
> >>>
> >>>The init process is created with empty TCRE list.
> >>>
> >>>There is a system-wide unique ID space, each TCRE is assigned
> >>>an ID from this space. ID's can be reused (but no two TCREs
> >>>have the same ID at one time).
> >>>
> >>>The interface accomodates transient and independent cache allocation
> >>>adjustments from applications, as well as static cache partitioning
> >>>schemes.
> >>>
> >>>Allocation:
> >>>Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
> >>>
> >>>A configurable percentage is reserved to tasks with empty TCRE list.
> >
> >Hi Vikas,
> >
> >>And how do you think you will do this without a system controlled
> >>mechanism ?
> >>Everytime in your proposal you include these caveats
> >>which actually mean to include a system controlled interface in the
> >>background ,
> >>and your below interfaces make no mention of this really ! Why do we
> >>want to confuse ourselves like this ?
> >>syscall only interface does not seem to work on its own for the
> >>cache allocation scenario. This can only be a nice to have interface
> >>on top of a system controlled mechanism like cgroup interface. Sure
> >>you can do all the things you did with cgroup with the same with
> >>syscall interface but the point is what are the use cases that cant
> >>be done with this syscall only interface. (ex: to deal with cases
> >>you brought up earlier like when an app does cache intensive work
> >>for some time and later changes - it could use the syscall interface
> >>to quickly reqlinquish the cache lines or change a clos associated
> >>with it)
> >
> >All use cases can be covered with the syscall interface.
> >
> >* How to convert from cgroups interface to syscall interface:
> >Cgroup: Partition cache in cgroups, add tasks to cgroups.
> >Syscall: Partition cache in TCRE, add TCREs to tasks.
> >
> >You build the same structure (task <--> CBM) either via syscall
> >or via cgroups.
> >
> >Please be more specific, can't really see any problem.
>
> Well at first you mentioned that the cgroup does not support
> specifying size in bytes and percentage and then you eventually
> agreed to my explanation that you can easily write a bash script to
> do the same with cgroup bitmasks. (although i had to go through the
> pain of reading all the proposals you sent without giving a chance
> to explain how it can be used or so).

Yes we could write the (bytes --to--> cacheways) convertion in
userspace. But since we are going for a different interface, can also
fix that problem as well in the kernel.

> Then you had a confusion in
> how I explained the co mounting of the cpuset and intel_rdt and
> instead of asking a question or pointing out issue, you go ahead and
> write a whole proposal and in the end even say will cook a patch
> before I even try to explain you.

The syscall interface is more flexible.

Why not use a more flexible interface if possible?

> And then you send proposals after proposals
> which varied from
> modifying the cgroup interface itself to slightly modifying cgroups

Yes, trying to solve the problems our customers will be facing in the field.
So, this proposals are not coming out of thin air.

> and adding syscalls and then also automatically controlling the
> cache alloc (with all your extend mask capabilities) without
> understanding what the framework is meant to do or just asking or
> specifically pointing out any issues in the patch.

There is a practical problem the "extension" of mask capabilities is
solving. Check item 6 of the attached text document.

> You had been
> reviewing the cgroup pathes for many versions unlike others who
> accepted they need time to think about it or accepted that they
> maynot understand the feature yet.
> So what is that changed in the patches that is not acceptable now ?

Tejun proposed a syscall interface. He is a right, a syscall interface
is much more flexible. Blame him.

> Many things have been bought up multiple times even after you agreed
> to a solution already proposed. I was only suggesting that this can
> be better and less confusing if you point out the exact issue in the
> patch just like how Thomas or all of the reviewers have been doing.
>
> With the rest of the reviewers I either fix the issue or point out a
> flaw in the review.
> If you dont like cgroup interface now ,
> would be best to indicate or
> discuss the specifics of the shortcommings clearly before sending
> new proposals.
> That way we can come up with an interface which does
> better and works better in linux if we can. Otherwise we may just
> end up adding more code which just does the same thing?
>
> However I have been working on an alternate interface as well and
> have just sent it for your ref.

Problem: locking.

> >>I have repeatedly listed the use cases that can be dealt with , with
> >>this interface. How will you address the cases like 1.1 and 1.2 with
> >>your syscall only interface ?
> >
> >Case 1.1:
> >--------
> >
> > 1.1> Exclusive access: The task cannot give *itself* exclusive
> >access from using the cache. For this it needs to have visibility of
> >the cache allocation of other tasks and may need to reclaim or
> >override others cache allocs which is not feasible (isnt that the
> >ability of a system managing agent?).
> >
> >Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can
> >create cache allocation and remove cache allocation from
> >other applications. So only the administrator could do it.
>
> The 1.1 also includes an other use case(lets call this 1.1.1) which
> indicates that the apps would just allocate a lot of cache and soon
> run out space. Hence the first few apps would get most of the cache
> (would get *most* even if you reserve some % of cache for others -
> and again thats difficult to assign to the others).
>
> Now if you say you want to put a threshold limit for each app to
> self allocate , then that turns out to an interface that can easily
> built on top of the existing cgroup interface. iow its just a
> control you are giving the app on top of an existing admin
> controlled interface (like cgroup).the threshold can just be the cbm
> of the cgroup which the tasks belong to. so now the apps can self
> allocate or reduce the allocation to something which is a subset the
> cgroup has (thats one way..)

Yes.

> Also the issue was to discuss whether self allocation or process
> deciding its own allocation vs. system controlled mechanism.
> It
> wasnt clear what syscalls among the ones need to have this sys_cap
> and which ones would not.
>
> >
> >Case 1.2 answer below.
> >
> >>So we expect all the millions of apps
> >>like SAP, oracle etc and etc and all the millions of app developers
> >>to magically learn our new syscall interface and also cooperate
> >>between themselves to decide a cache allocation that is agreeable to
> >>all ? (which btw the interface doesnt list below how to do it) and
> >
> >They don't have to: the administrator can use "cacheset" application.
>
> the "cacheset" wasnt mentioned before. Now you are talking about a
> tool which is also doing a centralized or system controlled
> allocation.

Not me. Tejun proposed that.

> This is where I pointed out earlier that its best to
> keep the discussion to the point and not randomly expand the scope
> to a variety of other options. If you want to build a taskset like
> tool thats again just doing a system conrolled interface or a
> centralized control mechamism which is what cgroup does. Then it
> just comes down to whether cgroup interface or the cacheset is more
> easy or intutive. And why would the already widely used interface
> for resource allocation be not intutive ? - we first need to answer
> that may be ? or any really required features it lacks ?
> Also give that dockers use cgroups for resource allocations , it
> seems most fit and thats the feedback i received repeatedly in
> linuxcon as well.
>
> >
> >If an application wants to control the cache, it can.
> >
> >>then by some godly powers the noisly neighbour will decide himself
> >>to give up the cache ?
> >
> >I suppose you imagine something like this:
> >http://arxiv.org/pdf/1410.6513.pdf
> >
> >No, the syscall interface does not need to care about that because:
> >
> >* If you can set cache (CAP_SYS_CACHE_RESERVATION capability),
> >you can remove cache reservation from your neighbours.
> >
> >So this problem does not exist (it assumes participants are
> >cooperative).
> >
> >There is one confusion in the argument for cases 1.1 and case 1.2:
> >that applications are supposed to include in their decision of cache
> >allocation size the status of the system as a whole. This is a flawed
> >argument. Please point specifically if this is not the case or if there
> >is another case still not covered.
>
> Like i said it wasnt clear what syscalls required this capability.
> also the 1.1.1 still breaks this , or iow the apps needs to have
> lesser control than a system/admin controlled allocation.

Should separate access control from ability of applications to change
cache allocations.

Problem-1: Separation of percentages of totality of cache to particular users

This assumes each user has credentials to allocate/reserve cache. You
don't want to give user A more than 30% of cache allocation because
user B requires 80% cache to achieve his performance requirements.

Problem-2: The decision to allocate cache is tied to application
initialization / destruction, and application initialization is
essentially random from the POV of the system (the events which trigger
the execution of the application are not visible from the system).

Think of a server running two different servers: one database
with requests that are received with random poisson distribution, average 30
requests per hour, and every request takes 1 minute.

One httpd server with nearly constant load.

Without cache reservations, database requests takes 2 minutes.
That is not acceptable for the database clients.
But with cache reservation, database requests takes 1 minute.

You want to maximize performance of httpd and database requests
What you do? You allow the database server to perform cache
reservation once a request comes in, and to undo the reservation
once the request is finished.

Its impossible to perform this with a centralized interface.

---

The point of the syscall interface is to handle problem-2 by allowing
applications to modify cache allocation themselves.

And ignores problem-1 (which is similar to case 1.1.1). Yes, if an
application can allocate 80% of cache hurting performance of
other applications, then it can.

There is nothing we can to do solve it. We can allow it, if the
administrator decides not to, he can remove CAP_SYS_CACHE... from users
to avoid the problem.

So problem 1.1.1 is dealt with.

> >It would be possible to partition the cache into watermarks such
> >as:
> >
> >task group A - can reserve up to 20% of cache.
> >task group B - can reserve up to 25% of cache.
> >task group C - can reserve 50% of cache.
> >
> >But i am not sure... Tejun, do you think that is necessary?
> >(CAP_SYS_CACHE_RESERVATION is good enough for our usecases).
> >
> >> (that should be first ever app to not request
> >>more resource in the world for himself and hurt his own performance
> >>- they surely dont want to do social service !)
> >>
> >>And how do we do the case 1.5 where the administrator want to assign
> >>cache to specific VMs in a cloud etc - with the hypothetical syscall
> >>interface we now should expect all the apps to do the above and now
> >>they also need to know where they run (what VM , what socket etc)
> >>and then decide and cooperate an allocation : compare this to a
> >>container environment like rancher where today the admin can
> >>convinetly use docker underneath to allocate mem/storage/compute to
> >>containers and easily extend this to include shared l3.
> >>
> >>http://marc.info/?l=linux-kernel&m=143889397419199
> >>
> >>without addressing the above the details of the interface below is irrelavant -
> >
> >You are missing the point, there is supposed to be a "cacheset"
> >program which will allow the admin to setup TCRE and assign them to
> >tasks.
> >
> >>Your initial request was to extend the cgroup interface to include
> >>rounding off the size of cache (which can easily be done with a bash
> >>script on top of cgroup interface !) and now you are proposing a
> >>syscall only interface ? this is very confusing and will only
> >>unnecessarily delay the process without adding any value.
> >
> >I suppose you are assuming that its necessary for applications to
> >set their own cache. This assumption is not correct.
> >
> >Take a look at Tuna / sched_getaffinity:
> >
> >https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html
> >
> >
> >>however like i mentioned the syscall interface or user/app being
> >>able to modify the cache alloc could be used to address some very
> >>specific use cases on top an existing system managed interface. This
> >>is not really a common case in cloud or container environment and
> >>neither a feasible deployable solution.
> >>Just consider the millions of apps that have to transition to such
> >>an interface to even use it - if thats the only way to do it, thats
> >>dead on arrival.
> >
> >Applications should not rely on interfaces that are not upstream.
> >
> >Is there an explicit request or comment from users about
> >their difficulty regarding a change in the interface?
>
> HOwever there needs to be a reasoning on why the cgroup interface is
> not good as well?

The main problem of the cgroup interface, to me, is problem-2 above.

> >>Also please donot include kernel automatically adjusting resources
> >>in your reply as thats totally irrelavent and again more confusing
> >>as we have already exchanged some >100 emails on this same patch
> >>version without meaning anything so far.
> >>
> >>The debate is purely between a syscall only interface and a system
> >>manageable interface(like cgroup where admin or a central entity
> >>controls the resources). If not define what is it first before going
> >>into details.
> >
> >See the Tuna / taskset page.
> >The administrator could, for example, use "cacheset" from within
> >the scripts which initialize the applications.
> >Then having control over those scripts, he can view them as a "unified
> >system control interface".
> >
> >Problems with cgroup interface:
> >
> >1) Global IPI on CBM <---> task change does not scale.
>
> DOnt understand this . how is the IPI related to cgroups. A task is
> associated with one closid and it needs to carry that along where
> ever it goes. it supports the use case i explain in (basicaly
> cloud/container and server user cases mainly)

Think of problem-2 above and the following:

* cbm_update_all() - Update the cache bit mask for all packages.
*/
static inline void cbm_update_all(u32 closid)
{
on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid, 1);
}

This needs to go.

> http://marc.info/?l=linux-kernel&m=144035279828805
>
> >2) Syscall interface specification is in kbytes, not
> >cache ways (which is what must be recorded by the OS
> >to allow migration of the OS between different
> >hardware systems).
>
> I thought you agreed that a simple bash script can convert the
> bitmask to bytes in chunk size. ALl you need is the cache size from
> /proc/cpuinfo and the max cbm bits in the root intel_rdt cgroup.

Yes, but that requires every user of the interface which considers
the possibility of moving to different platforms to perform
that convertion.

Why force the user (or the programmer) to maintain a quantity
that is not useable in any of those environments ?

So the above facts mean its preferred to expose size in bytes.

Yes, i had agree to "fix" this issue in userspace, but since there are
discussions to change interface, why not fix that problem as well in the
kernel rather than userspace?

> And
> its incorrect to say you can do it it bytes. Its only chunk size
> really. (chunk size = cache size / max cbm bits).

Yes, you can do it in bytes. Its written in the syscall
proposal how you can do that.

> Apart from that the mask gives you the ability to decide an
> exclusive, overlapping, or partially overlapping and partially
> exclusive masks.
>
> >3) Compilers are able to configure cache optimally for
> >given ranges of code inside applications, easily,
> >if desired.
>
> This is again not possible because of 1.1.1. And can be still done
> in a restricted fashion like i explained above.

1.1.1 is not a blocker. If it were, a similar argument would
be valid for sys_schedsetaffinity:

It is not possible to allow applications to set their own affinity
because two applications might set affinity for the same pCPU which
affects performance of both.

But still, applications with CAP_SYS_NICE are allowed to set their
own affinity.

> >4) Does not allow proper usage of shared caches between
> >applications. Think of the following scenario:
> > * AppA has threads which are created/destroyed,
> > but once initialized, want cache reservation.
> > * How is AppA going to coordinate with cgroups
> > system to initialized/shutdown cgroups?
> >
>
> Yes , the interface does not support apps to self control cache
> alloc. That is accepted. But this is not the main use case we target
> like i explained above and in the link i provided for the new
> proposal and before.. So its not very important as such.
> Also worst case, you can easily design a syscall for apps to self
> control keeping the cgroup alloc for the task as max threshold.
> So lets nail this list(of cgroup flaws you list) down before
> thinking about changes ? - this should have been the first things in
> the email really is what i was mentioning.
>
> >I started writing the syscall interface on top of your latest
> >patchset yesterday (it should be relatively easy, given
> >that most of the low-level code is already there).
> >
> >Any news on the data/code separation ?
>
> Will send them this week , untested partially due to h/w not yet
> being with me. Have been ready , but was waiting to see the
> discussions on this patch as well.
>
> more response below -
>
> >
> >
> >>Thanks,
> >>Vikas
> >>
> >>>
> >>>On fork, the child inherits the TCR from its parent.
> >>>
> >>>Semantics:
> >>>Once a TCRE is created and assigned to a task, that task has
> >>>guaranteed reservation on any CPU where its scheduled in,
> >>>for the lifetime of the TCRE.
> >>>
> >>>A task can have its TCR list modified without notification.
>
> Whey does the task need a list of allocations ? A task is tagged
> with only one closid and it needs to carry that along. Even if the
> list is for each socket, that needs be an array.

See item 5 of the attached text.

> >>>FIXME: Add a per-task flag to not copy the TCR list of a task but delete
> >>>all TCR's on fork.
> >>>
> >>>Interface:
> >>>
> >>>enum cache_rsvt_flags {
> >>> CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */
> >>>};
>
> Not really optional is it ? the chunk size is decided by the h/w sku
> and you can only allocate in that chunk size, not any bytes.

Specify cache reservation in bytes.
By default, OS rounds bytes to cache ways.
This flag allows OS to round bytes down to cache ways.

> >>>
> >>>enum cache_rsvt_type {
> >>> CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */
> >>> CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */
> >>> CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */
> >>>};
> >>>
> >>>struct cache_reservation {
> >>> unsigned long kbytes;
>
> should be rounded off to chunk size really. And like i explained
> above the masks let you do the exclusive/partially adjustable
> percentage exclusive easily (say 20% shared and rest exclusive) or a
> tolerated amount of shared...

Please read sentence above.

> >>> int type;
> >>> int flags;
> >>> int trcid;
> >>>};
> >>>
> >>>The following syscalls modify the TCR of a task:
> >>>
> >>>* int sys_create_cache_reservation(struct cache_reservation *rsvt);
> >>>DESCRIPTION: Creates a cache reservation entry, and assigns
> >>>it to the current task.
>
> So now i assume this is what the task can do itself and the ones
> below which pid need the capability ? Again this breaks 1.1.1 like i
> said above and any way to restrict to a threshold max alloc can just
> easily be done on top of cgroup alloc keeping the cgroup alloc as
> max threshold.

Not a problem, see sys_schedsetaffinity argument.

> >>>returns -ENOMEM if not enough space, -EPERM if no permission.
> >>>returns 0 if reservation has been successful, copying actual
> >>>number of kbytes reserved to "kbytes", type to type, and tcrid.
> >>>
> >>>* int sys_delete_cache_reservation(struct cache_reservation *rsvt);
> >>>DESCRIPTION: Deletes a cache reservation entry, deassigning it
> >>>from any task.
> >>>
> >>>Backward compatibility for processors with no support for code/data
> >>>differentiation: by default code and data cache allocation types
> >>>fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> >>>information that they done so via "flags").
>
> Need to address the change of mode which is dynamic

There is no change of mode in the following case:

I/D capable processor: boots with I/D enabled and remains that way.
not I/D capable processor: boots with I/D disabled and remains that
way.

Do you see any problem with this scheme?

> and it may be
> more intutive to do that in cgroups for the reasons i said above and
> taking allocation back from a process may need a call back, thats
> why it may best be to design an interface where the apps know their
> control is very limited and within the purview of the already set
> allocations by root user.
>
> Please check the new proposal which tries to addresses the comments
> i made mostly -
> http://marc.info/?l=linux-kernel&m=144035279828805
> The framework still lets any kernel mode or high level user mode
> library developer build a cacheset like tool or others on top of it
> if that needs to be more custom and more intutive.
>
> Thanks,
> Vikas

A major problem of any filesystem based interface, pointed out by Tejun,
is that locking must performed by the user.

With the syscall interface, the kernel can properly handle locking for
the user. Can use RCU to nicely deal with locking in the kernel.

One issue you are trying to deal with, that i ignored, is Problem-1:
division of cache allocability per user.

> >>>* int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
> >>>DESCRIPTION: Attaches cache reservation identified by "tcrid" to
> >>>task by identified by pid.
> >>>returns 0 if successful.
> >>>
> >>>* int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
> >>>DESCRIPTION: Detaches cache reservation identified by "tcrid" to
> >>>task by identified pid.
> >>>
> >>>The following syscalls list the TCRs:
> >>>* int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
> >>>DESCRIPTION: Return all cache reservations in the system.
> >>>Size should be set to the maximum number of items that can be stored
> >>>in the buffer pointed to by list.
> >>>
> >>>* int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
> >>>DESCRIPTION: Return which pids are associated to tcrid.
> >>>
> >>>* sys_get_pid_cache_reservations(pid_t pid, size_t size,
> >>> struct cache_reservation list[]);
> >>>DESCRIPTION: Return all cache reservations associated with "pid".
> >>>Size should be set to the maximum number of items that can be stored
> >>>in the buffer pointed to by list.
> >>>
> >>>* sys_get_cache_reservation_info()
> >>>DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
> >>>code/data separation is supported.
> >>>
> >>>
> >
1) Global IPI on CBM <---> task change does not scale.

* cbm_update_all() - Update the cache bit mask for all packages.
*/
static inline void cbm_update_all(u32 closid)
{
on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid, 1);
}

Consider a machine with 32 sockets.

2) Syscall interface specification is in kbytes, not
cache ways (which is what must be recorded by the OS
to allow migration of the OS between different
hardware systems).

3) Compilers are able to configure cache optimally for
given ranges of code inside applications, easily,
if desired.

4) Problem-2: The decision to allocate cache is tied to application
initialization / destruction, and application initialization is
essentially random from the POV of the system (the events which trigger
the execution of the application are not visible from the system).

Think of a server running two different servers: one database
with requests that are received with poisson distribution, average 30
requests per hour, and every request takes 1 minute.

One httpd server with nearly constant load.

Without cache reservations, database requests takes 2 minutes.
That is not acceptable for the database clients.
But with cache reservation, database requests takes 1 minute.

You want to maximize performance of httpd and database requests
What you do? You allow the database server to perform cache
reservation once a request comes in, and to undo the reservation
once the request is finished.

Its impossible to perform this with a centralized interface.

5) Modify scenario 2 above as follows: each database request
is handled by two newly created threads, and they share a certain percentage
of data cache, and a certain percentage of code cache.

So the dispatcher thread, on arrival of request, has to:

- create data cache reservation = tcrid-A.
- create code cache reservation = tcrid-B.
- create thread-1.
- assign tcird-A and B to thread-1.
- create thread-2.
- assign tcird-A and B to thread-2.

6) Create reservations in such a way that the sum is larger than
total amount of cache, and CPU pinning (example from Karen Noel):

VM-1 on socket-1 with 80% of reservation.
VM-2 on socket-2 with 80% of reservation.
VM-1 pinned to socket-1.
VM-2 pinned to socket-2.

Cgroups interface attempts to set a cache mask globally. This is the
problem the "expand" proposal solves:
https://lkml.org/lkml/2015/7/29/682