Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support

From: Marcelo Tosatti
Date: Mon Oct 19 2015 - 15:29:10 EST

Next message: j . glisse: "[PATCH] locking/lockdep: Fix expected depth value in __lock_release()"
Previous message: Shuah Khan: "Re: [RFC v4 9/9] kmsg: selftests"
In reply to: Peter Zijlstra: "Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support"
Next in thread: Marcelo Tosatti: "Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Oct 16, 2015 at 11:44:52AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 15, 2015 at 09:17:16PM -0300, Marcelo Tosatti wrote:
> > On Thu, Oct 15, 2015 at 01:37:02PM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 13, 2015 at 07:40:58PM -0300, Marcelo Tosatti wrote:
> > > > How can you fix the issue of sockets with different reserved cache
> > > > regions with hw in the cgroup interface?
> > >
> > > No idea what you're referring to. But IOCTLs blow.
> >
> > Tejun brought up syscalls. Syscalls seem too generic.
> > So ioctls were chosen instead.
> >
> > It is necessary to perform the following operations:
> >
> > 1) create cache reservation (params = size, type).
>
> mkdir
>
> > 2) delete cache reservation.
>
> rmdir
>
> > 3) attach cache reservation (params = cache reservation id, pid).
> > 4) detach cache reservation (params = cache reservation id, pid).
>
> echo $pid > tasks
>
> > Can it done via cgroups? If so, works for me.
>
> Trivially.

Fine.

Tejun brought the problem of locking: how do you coordinate locking
between different users? (on the mkdir / rmdir scenario above).

>
> > A list of problems with the cgroup interface has been written,
> > in the thread... and we found another problem.
>
> Which was endless and tiresome so I stopped reading.
>
> > List of problems with cgroup interface:
> >
> > 1) Global IPI on CBM <---> task change does not scale.
> >
> > * cbm_update_all() - Update the cache bit mask for all packages.
> > */
> > static inline void cbm_update_all(u32 closid)
> > {
> > on_each_cpu_mask(&rdt_cpumask, cbm_cpu_update, (void *)closid,
> > 1);
> > }
>
> There is no way around that, the moment you view the CBM as a global
> resource; ie. a CBM is configured the same on all sockets; you need to
> do this for a task using that CBM might run on any CPU at any time.
>
> This is not because of the cgroup interface at all. This is because you
> want CBMs to be the same machine wide.

You don't, for two reasons:

1) Item 6 below.
2) Item 7 below.

Please follow on with the discussion (just scroll down and read and
reply inline: item 6 and machine wide CBMs are not incompatible
because...).

> The only way to actually change that is to _be_ a cgroup and co-mount
> with cpusets and be incestuous and look at the cpusets state and
> discover disjoint groups.
>
> > 2) Syscall interface specification is in kbytes, not
> > cache ways (which is what must be recorded by the OS
> > to allow migration of the OS between different
> > hardware systems).
>
> Meh, that again is nothing fundamental. The cgroup interface could do
> bytes just the same.

Yes.

> > 3) Compilers are able to configure cache optimally for
> > given ranges of code inside applications, easily,
> > if desired.
>
> Yeah, so? Every SKU has a different cache size, so once you're down to
> that level you're pretty hard set in your configuration and it really
> doesn't matter if you give bytes or ways, you _KNOW_ what your
> configuration will be.

That item has nothing to do with cache ways in bytes or ways.

> > 4) Problem-2: The decision to allocate cache is tied to application
> > initialization / destruction, and application initialization is
> > essentially random from the POV of the system (the events which trigger
> > the execution of the application are not visible from the system).
> >
> > Think of a server running two different servers: one database
> > with requests that are received with poisson distribution, average 30
> > requests per hour, and every request takes 1 minute.
> >
> > One httpd server with nearly constant load.
> >
> > Without cache reservations, database requests takes 2 minutes.
> > That is not acceptable for the database clients.
> > But with cache reservation, database requests takes 1 minute.
> >
> > You want to maximize performance of httpd and database requests
> > What you do? You allow the database server to perform cache
> > reservation once a request comes in, and to undo the reservation
> > once the request is finished.
>
> > Its impossible to perform this with a centralized interface.
>
> Not so; just a wee bit more fragile that desired. But, this is a
> pre-existing problem with cgroups and needs to be solved, not using
> cgroups because of this is silly.
>
> Every cgroup that can work on tasks suffers this and arguably a few
> more.
>
> > 5) Modify scenario 2 above as follows: each database request
> > is handled by two newly created threads, and they share a certain
> > percentage
> > of data cache, and a certain percentage of code cache.
> >
> > So the dispatcher thread, on arrival of request, has to:
> >
> > - create data cache reservation = tcrid-A.
> > - create code cache reservation = tcrid-B.
> > - create thread-1.
> > - assign tcird-A and B to thread-1.
> > - create thread-2.
> > - assign tcird-A and B to thread-2.
> >
> > 6) Create reservations in such a way that the sum is larger than
> > total amount of cache, and CPU pinning (example from Karen Noel):
> >
> > VM-1 on socket-1 with 80% of reservation.
> > VM-2 on socket-2 with 80% of reservation.
> > VM-1 pinned to socket-1.
> > VM-2 pinned to socket-2.
> >
> > Cgroups interface attempts to set a cache mask globally. This is the
> > problem the "expand" proposal solves:
> > https://lkml.org/lkml/2015/7/29/682
>
> That email is unparsable.

Look at item 6. If you create reservations in such a way that the sum
is larger than total amount of cache, "cosid0" which is the
"unconstrained set of tasks" (ie: rest of the system) have 0 bytes of
L3 cache to reclaim from.

> But the only way to sanely do so it do closely
> intertwine oneself with cpusets, doing that with anything other than
> another cgroup controller absolutely full on insane.

void __intel_rdt_sched_in(void)
{
struct task_struct *task = current;
unsigned int cpu = smp_processor_id();
unsigned int this_socket = topology_physical_package_id(cpu);
unsigned int start, end;

/*
* The CBM bitmask for a particular task is enforced
* on sched-in to a given processor, and only for the
* range (cbm_start_bit,cbm_end_bit) which the
* tcr_list (COSid) owns.
* This way we allow COSid0 (global task pool) to use
* reserved L3 cache on sockets where the tasks that
* reserve the cache have not been scheduled.
*
* Since reading the MSRs is slow, it is necessary to
* cache the MSR CBM map on each socket.
*
*/

if (test_bit(this_socket,
task->tcrlist->synced_to_socket) == 0) {

Makes sense?

>
> > 7) Consider two sockets with different region of L3 cache
> > shared with HW:
> >
> > â CPUID.(EAX=10H, ECX=1):EBX[31:0] reports a bit mask. Each set bit
> > within the length of the CBM
> > indicates the corresponding unit of the L3 allocation may be used by
> > other entities in the platform (e.g. an
> > integrated graphics engine or hardware units outside the processor core
> > and have direct access to L3).
> > Each cleared bit within the length of the CBM indicates the
> > corresponding allocation unit can be configured
> > to implement a priority-based allocation scheme chosen by an OS/VMM
> > without interference with other
> > hardware agents in the system. Bits outside the length of the CBM are
> > reserved.
> >
> > You want the kernel to maintain different bitmasks in the CBM:
> >
> > socket1 [range-A]
> > socket2 [range-B]
> >
> > And the kernel will automatically switch from range A to range B
> > when the thread switches sockets.
>
> This is firmly in the insane range of things.. not going to happen full
> stop.

Are you saying that hardware will guarantee reserved region is the same
for all sockets? I asked Vikas and he said this is not the case.

> It a thread can freely schedule between two CPUs its configuration on
> those two CPUs had better bloody be the same.

Its just the (start,end) of the CBM which changes, so on
__intel_rdt_sched_in you do:

struct per_socket_data *psd = get_socket_data(this_socket);
struct cache_layout *layout = psd->layout;

start = task->tcrlist->psd[layout->id].cbm_start;
end = task->tcrlist->psd[layout->id].cbm_end;
sync_to_msr(tcrlist, start, end);

Please clarify what you mean.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: j . glisse: "[PATCH] locking/lockdep: Fix expected depth value in __lock_release()"
Previous message: Shuah Khan: "Re: [RFC v4 9/9] kmsg: selftests"
In reply to: Peter Zijlstra: "Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support"
Next in thread: Marcelo Tosatti: "Re: [PATCH V15 00/11] x86: Intel Cache Allocation Technology Support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]