Re: [RFD] CAT user space interface revisited

From: Marcelo Tosatti
Date: Wed Jan 06 2016 - 07:47:08 EST


On Wed, Jan 06, 2016 at 12:09:50AM +0100, Thomas Gleixner wrote:
> Marcelo,
>
> On Mon, 4 Jan 2016, Marcelo Tosatti wrote:
> > On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > > I don't have an idea how that would look like. The current structure is a
> > > cgroups based hierarchy oriented approach, which does not allow simple things
> > > like
> > >
> > > T1 00001111
> > > T2 00111100
> > >
> > > at least not in a way which is natural to the problem at hand.
> >
> >
> >
> > cgroupA/
> >
> > cbm_mask (if set, set for all CPUs)
>
> You mean sockets, right?
>
> >
> > socket1/cbm_mask
> > socket2/cbm_mask
> > ...
> > socketN/cbm_mask (if set, overrides global
> > cbm_mask).
> >
> > Something along those lines.
> >
> > Do you see any problem with it?
>
> So for that case:
>
> task1: cbm_mask 00001111
> task2: cbm_mask 00111100
>
> i.e. task1 and task2 share bit 2/3 of the mask.
>
> I need to have two cgroups: cgroup1 and cgroup2, task1 is member of cgroup1
> and task2 is member of cgroup2, right?
>
> So now add some more of this and then figure out, which cbm_masks are in use
> on which socket. That means I need to go through all cgroups and find the
> cbm_masks there.

Yes.

> With my proposed directory structure you get a very clear view about the
> in-use closids and the associated cbm_masks. That view represents the hardware
> in the best way. With the cgroups stuff we get an artificial representation
> which does not tell us anything about the in-use closids and the associated
> cbm_masks.

Because you expose cos-ID ---> cbm / cdp masks.

Fine, agree thats nice.

> > > I cannot imagine how that modification to the current interface would solve
> > > that. Not to talk about per CPU associations which are not related to tasks at
> > > all.
> >
> > Not sure what you mean by per CPU associations.
>
> As I wrote before:
>
> "It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all."
>
> > If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> > run on that pCPU, then you control the cbmmask for all tasks (say
> > tasklist-1) on that CPU, fine.
> >
> > Can achieve the same by putting all tasks from tasklist-1 into a
> > cgroup.
>
> Which means, that I need to go and find everything including kernel threads
> and put them into a particular cgroup. That's really not useful and it simply
> does not work:
>
> To which cgroup belongs a dynamically created per cpu worker thread? To the
> cgroup of the parent. But is the parent necessarily in the proper cgroup? No,
> there is no guarantee. So it ends up in some random cgroup unless I start
> chasing every new thread, instead of letting it use the default cosid of the
> CPU.

Well, i suppose cgroups has facilities to handle this? That is, what is
required is:

On task creation, move the new task to a particular cgroup, based on
some visible characteristic of the task: (process name matching OR explicit
kernel thread creator specification OR ...).

Because there are two cases. Consider a kernel thread T, which contains
code that is timing sensitive therefore requires to use a COSID (which
means use a reserved portion of cache).

Case 1) kernel thread T starts kernel thread R, which is also timing
sensitive (and wants to use the same COSID as kernel thread T).
In that case, the cgroup's default (inherit cgroup from parent)
behaviour is correct.

Case 2) kernel thread T starts kernel thread X, which is not timing
sensitive, therefore kernel thread X should use "default cosid".
In the case of cgroups, in the example used elsewhere in this thread,
kernel thread X should be moved to "cgroupsALL".

Strictly speaking there is a third case:

Case 3) kernel thread T starts kernel thread Z, which wants to
be moved to a different COSID other than kernel thread T's COSID.

So using the default COSID is not necessarily the correct thing to do
(this should be configurable on a per-case basis).

> Having a per cpu default cos-id which is used when the task does not have a
> cos-id associated makes a lot of sense and makes it simpler to utilize that
> facility.

You would need a facility to switch to "inherit cgroup from parent"
mode, and also to handle case 3 (which i supposed cgroups did, because
the same problem exists for other cgroup controllers).

> > > >> Per cpu default cos id for the cpus on that socket:
> > > >>
> > > >> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> > > >> ...
> > > >> xxxxxxx/cat/socket-N/cpu-N/default_cosid
> > > >>
> > > >> The above allows a simple cpu based partitioning. All tasks which do
> > > >> not have a cache partition assigned on a particular socket use the
> > > >> default one of the cpu they are running on.
> > >
> > > Where is that information in (*2) and how is that related to (*1)? If you
> > > think it's not required, please explain why.
> >
> > Not required because with current Intel patchset you'd do:
>
> <SNIP>
> ...
> </SNIP>
>
> > # cat intel_rdt.l3_cbm
> > 000ffff0
> > # cat ../cgroupALL/intel_rdt.l3_cbm
> > 000000ff
> >
> > Bits f0 are shared between cgroupRSVD and cgroupALL. Lets change:
> > # echo 0xf > ../cgroupALL/intel_rdt.l3_cbm
> > # cat ../cgroupALL/intel_rdt.l3_cbm
> > 0000000f
> >
> > Now they share none.
>
> Well, you changed ALL and everything, but you still did not assign a
> particular cos-id to a particular CPU as their default.
>
> > > >> Now for the task(s) partitioning:
> > > >>
> > > >> xxxxxxx/cat/partitions/
> > > >>
> > > >> Under that directory one can create partitions
> > > >>
> > > >> xxxxxxx/cat/partitions/p1/tasks
> > > >> /socket-0/cosid
> > > >> ...
> > > >> /socket-n/cosid
> > > >>
> > > >> The default value for the per socket cosid is COSID_DEFAULT, which
> > > >> causes the task(s) to use the per cpu default id.
> > >
> > > Where is that information in (*2) and how is that related to (*1)? If you
> > > think it's not required, please explain why.
> > >
> > > Yes. I ask the same question several times and I really want to see the
> > > directory/interface structure which solves all of the above before anyone
> > > starts to implement it.
> >
> > I don't see the problem, have a sequence of commands above which shows
> > to set a directory structure which is useful and does what the HW
> > interface is supposed to do.
>
> Well, you have a sequence of commands, which gives you the result which you
> need for your particular problem.
>
> > > We already have a completely useless interface (*1)
> > > and there is no point to implement another one based on it (*2) just because
> > > it solves your particular issue and is the fastest way forward. User space
> > > interfaces are hard and we really do not need some half baken solution which
> > > we have to support forever.
> >
> > Fine. Can you please tell me what i can't do with the current interface?
> > AFAICS everything can be done (except missing support for (*2)).
>
> 1) There is no consistent view of the facility. Is a sysadmin supposed to add
> printks to the kernel to figure that out or should he keep track of that
> information on a piece of paper? Neither option is useful if you have to
> analyze a system which was set up 3 month ago.

Parse the cgroups CAT directory.

> 2) Simple and easy default settings for tasks and CPUs
>
> Rather than forcing the admin to find everything which might be related,
> it's way better to have configurable defaults.
>
> The cgroup interface allows you to set a task default, because everything
> which is in the root group is going to use that, but that default is
> useless. See #3
>
> That still does not give me the an simple and easy to use way to set a per
> cpu default.

I also dislike the cgroups interface, your proposal is indeed nicer.

>
> 3) Non hierarchical setup
>
> The current interface has a rdt_root_group, which is set up at init. That
> group uses a closid (one of the few we have). And you cannot use that thing
> for anything else than having all bits set in the mask because all groups
> you create underneath must be a subset of the parent group.
>
> That is simply crap.
>
> We force something which is entirely not hierarchical into a structure
> which is designed for hierarchical problems and thereby waste one of the
> scarce and pretious resources.

Agree.

> That technology is straight forward partitioning and has nothing
> hierarchical at all.
>
> The decision to use cgroups was wrong in the beginning and it does not become
> better by pretending that it solves some particular use cases and by repeating
> that everything can be solved with it.
>
> If all I have is a hammer I certainly can pretend that everything is a
> nail. We all know how well that works ....
>
> > > 4) Per cpu default cos-id association
> >
> > This already exists, and as noted in the command sequence above,
> > works just fine. Please explain what problem are you seeing.
>
> No it does not exist. You emulate it by forcing stuff into cgroups which is
> not at all required if you have a proper and well thought out interface.
>
> > > 5) Task association to cos-id
> >
> > Not sure what that means. Please explain.
>
> > > >> xxxxxxx/cat/partitions/p1/tasks
> > > >> /socket-0/cosid
> > > >> ...
> > > >> /socket-n/cosid
>
> Yes, I agree, that this is very similar to the cgroup mechanism, but it is not
> in a pointless hierarchy. It's just the last step of the mechanism which I
> proposed to represent the hardware in the best way and give the admin the
> required flexibility. Again:
>
> This is general information:
>
> xxxxxxx/cat/max_cosids
> xxxxxxx/cat/max_maskbits
> xxxxxxx/cat/cdp_enable
>
> This is per socket information and per socket cos-id management
>
> xxxxxxx/cat/socket-0/...
> xxxxxxx/cat/socket-N/hwsharedbits
> /cos-id-0/...
> /cos-id-N/in-use
> /cat_mask
> /cdp_mask
>
> This is per cpu default cos-id association
>
> xxxxxxx/cat/socket-0/...
> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> xxxxxxx/cat/socket-N/cpu-N/default_cosid
>
> This is partitioning, where tasks are associated.
>
> xxxxxxx/cat/partitions/
> xxxxxxx/cat/partitions/p1/tasks
> /socket-0/cosid
> /socket-N/cosid
>
> That's the part which can be expressed with cgroups somehow, but for the price
> of losing a cosid and having a pointless hierarchy. Again, there is nothing
> hierarchical in RDT/CAT/CDP. It's all about partitioning and unfortunately the
> number of possible partitions is massively limited.
>
> I asked you last time already, but you just gave me random shell commands to
> show that it can be done. I ask again:
>
> Can you please explain in a simple directory based scheme, like the one I
> gave you above how all of these points are going to be solved with "some
> modifications" to the existing cgroup thingy.
>
> And just for completeness, lets look at a practical real world use case:
>
> 1 Socket
> 18 CPUs
> 4 COSIds (Yes, that's all that hardware gives me)
> 32 mask bits
> 2 hw shared bits at position 30 and 31
>
> Have 4 CPU partitions:
>
> CPU 0 - 5 general tasks
> CPU 6 - 9 RT1
> CPU 10 - 13 RT2
> CPU 14 - 17 RT3
>
> , Let each CPU partition have 1/4 of the cache.
>
> Here is my solution:
>
> # echo 0xff000000 > xxxx/cat/socket-0/cosid-0
> # echo 0x00ff0000 > xxxx/cat/socket-0/cosid-1
> # echo 0x0000ff00 > xxxx/cat/socket-0/cosid-2
> # echo 0x000000ff > xxxx/cat/socket-0/cosid-3
>
> # for CPU in 0 1 2 3 4 5; do
> # echo 0 > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
> # done
>
> # CPU=6
> # while [ $I -lt 18 ]; do
> # let "ID = 1 + (($CPU - 6) / 4)"
> # echo $ID > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
> # let "CPU += 1"
> # done
>
> That's it. Simple, right?
>
> The real interesting thing here is, that you can't do that at all with that
> current cgroup thingy. Simply because you run out of cosids.
>
> Even if you have enough COSids on a newer CPU then your solution will be way
> more complex and you still have not solved the issues of chasing kernel
> threads etc.
>
> Thanks,
>
> tglx

Fine, i agree. Need to solve the problem of assignment of cosids on
creation of kernel threads as discussed above (the 3 points).

Fenghua what are your thoughts?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/