Re: [RFD] CAT user space interface revisited
From: Thomas Gleixner
Date: Tue Jan 05 2016 - 18:11:39 EST
Marcelo,
On Mon, 4 Jan 2016, Marcelo Tosatti wrote:
> On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > I don't have an idea how that would look like. The current structure is a
> > cgroups based hierarchy oriented approach, which does not allow simple things
> > like
> >
> > T1 00001111
> > T2 00111100
> >
> > at least not in a way which is natural to the problem at hand.
>
>
>
> cgroupA/
>
> cbm_mask (if set, set for all CPUs)
You mean sockets, right?
>
> socket1/cbm_mask
> socket2/cbm_mask
> ...
> socketN/cbm_mask (if set, overrides global
> cbm_mask).
>
> Something along those lines.
>
> Do you see any problem with it?
So for that case:
task1: cbm_mask 00001111
task2: cbm_mask 00111100
i.e. task1 and task2 share bit 2/3 of the mask.
I need to have two cgroups: cgroup1 and cgroup2, task1 is member of cgroup1
and task2 is member of cgroup2, right?
So now add some more of this and then figure out, which cbm_masks are in use
on which socket. That means I need to go through all cgroups and find the
cbm_masks there.
With my proposed directory structure you get a very clear view about the
in-use closids and the associated cbm_masks. That view represents the hardware
in the best way. With the cgroups stuff we get an artificial representation
which does not tell us anything about the in-use closids and the associated
cbm_masks.
> > I cannot imagine how that modification to the current interface would solve
> > that. Not to talk about per CPU associations which are not related to tasks at
> > all.
>
> Not sure what you mean by per CPU associations.
As I wrote before:
"It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all."
> If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> run on that pCPU, then you control the cbmmask for all tasks (say
> tasklist-1) on that CPU, fine.
>
> Can achieve the same by putting all tasks from tasklist-1 into a
> cgroup.
Which means, that I need to go and find everything including kernel threads
and put them into a particular cgroup. That's really not useful and it simply
does not work:
To which cgroup belongs a dynamically created per cpu worker thread? To the
cgroup of the parent. But is the parent necessarily in the proper cgroup? No,
there is no guarantee. So it ends up in some random cgroup unless I start
chasing every new thread, instead of letting it use the default cosid of the
CPU.
Having a per cpu default cos-id which is used when the task does not have a
cos-id associated makes a lot of sense and makes it simpler to utilize that
facility.
> > >> Per cpu default cos id for the cpus on that socket:
> > >>
> > >> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> > >> ...
> > >> xxxxxxx/cat/socket-N/cpu-N/default_cosid
> > >>
> > >> The above allows a simple cpu based partitioning. All tasks which do
> > >> not have a cache partition assigned on a particular socket use the
> > >> default one of the cpu they are running on.
> >
> > Where is that information in (*2) and how is that related to (*1)? If you
> > think it's not required, please explain why.
>
> Not required because with current Intel patchset you'd do:
<SNIP>
...
</SNIP>
> # cat intel_rdt.l3_cbm
> 000ffff0
> # cat ../cgroupALL/intel_rdt.l3_cbm
> 000000ff
>
> Bits f0 are shared between cgroupRSVD and cgroupALL. Lets change:
> # echo 0xf > ../cgroupALL/intel_rdt.l3_cbm
> # cat ../cgroupALL/intel_rdt.l3_cbm
> 0000000f
>
> Now they share none.
Well, you changed ALL and everything, but you still did not assign a
particular cos-id to a particular CPU as their default.
> > >> Now for the task(s) partitioning:
> > >>
> > >> xxxxxxx/cat/partitions/
> > >>
> > >> Under that directory one can create partitions
> > >>
> > >> xxxxxxx/cat/partitions/p1/tasks
> > >> /socket-0/cosid
> > >> ...
> > >> /socket-n/cosid
> > >>
> > >> The default value for the per socket cosid is COSID_DEFAULT, which
> > >> causes the task(s) to use the per cpu default id.
> >
> > Where is that information in (*2) and how is that related to (*1)? If you
> > think it's not required, please explain why.
> >
> > Yes. I ask the same question several times and I really want to see the
> > directory/interface structure which solves all of the above before anyone
> > starts to implement it.
>
> I don't see the problem, have a sequence of commands above which shows
> to set a directory structure which is useful and does what the HW
> interface is supposed to do.
Well, you have a sequence of commands, which gives you the result which you
need for your particular problem.
> > We already have a completely useless interface (*1)
> > and there is no point to implement another one based on it (*2) just because
> > it solves your particular issue and is the fastest way forward. User space
> > interfaces are hard and we really do not need some half baken solution which
> > we have to support forever.
>
> Fine. Can you please tell me what i can't do with the current interface?
> AFAICS everything can be done (except missing support for (*2)).
1) There is no consistent view of the facility. Is a sysadmin supposed to add
printks to the kernel to figure that out or should he keep track of that
information on a piece of paper? Neither option is useful if you have to
analyze a system which was set up 3 month ago.
2) Simple and easy default settings for tasks and CPUs
Rather than forcing the admin to find everything which might be related,
it's way better to have configurable defaults.
The cgroup interface allows you to set a task default, because everything
which is in the root group is going to use that, but that default is
useless. See #3
That still does not give me the an simple and easy to use way to set a per
cpu default.
3) Non hierarchical setup
The current interface has a rdt_root_group, which is set up at init. That
group uses a closid (one of the few we have). And you cannot use that thing
for anything else than having all bits set in the mask because all groups
you create underneath must be a subset of the parent group.
That is simply crap.
We force something which is entirely not hierarchical into a structure
which is designed for hierarchical problems and thereby waste one of the
scarce and pretious resources.
That technology is straight forward partitioning and has nothing
hierarchical at all.
The decision to use cgroups was wrong in the beginning and it does not become
better by pretending that it solves some particular use cases and by repeating
that everything can be solved with it.
If all I have is a hammer I certainly can pretend that everything is a
nail. We all know how well that works ....
> > 4) Per cpu default cos-id association
>
> This already exists, and as noted in the command sequence above,
> works just fine. Please explain what problem are you seeing.
No it does not exist. You emulate it by forcing stuff into cgroups which is
not at all required if you have a proper and well thought out interface.
> > 5) Task association to cos-id
>
> Not sure what that means. Please explain.
> > >> xxxxxxx/cat/partitions/p1/tasks
> > >> /socket-0/cosid
> > >> ...
> > >> /socket-n/cosid
Yes, I agree, that this is very similar to the cgroup mechanism, but it is not
in a pointless hierarchy. It's just the last step of the mechanism which I
proposed to represent the hardware in the best way and give the admin the
required flexibility. Again:
This is general information:
xxxxxxx/cat/max_cosids
xxxxxxx/cat/max_maskbits
xxxxxxx/cat/cdp_enable
This is per socket information and per socket cos-id management
xxxxxxx/cat/socket-0/...
xxxxxxx/cat/socket-N/hwsharedbits
/cos-id-0/...
/cos-id-N/in-use
/cat_mask
/cdp_mask
This is per cpu default cos-id association
xxxxxxx/cat/socket-0/...
xxxxxxx/cat/socket-N/cpu-x/default_cosid
xxxxxxx/cat/socket-N/cpu-N/default_cosid
This is partitioning, where tasks are associated.
xxxxxxx/cat/partitions/
xxxxxxx/cat/partitions/p1/tasks
/socket-0/cosid
/socket-N/cosid
That's the part which can be expressed with cgroups somehow, but for the price
of losing a cosid and having a pointless hierarchy. Again, there is nothing
hierarchical in RDT/CAT/CDP. It's all about partitioning and unfortunately the
number of possible partitions is massively limited.
I asked you last time already, but you just gave me random shell commands to
show that it can be done. I ask again:
Can you please explain in a simple directory based scheme, like the one I
gave you above how all of these points are going to be solved with "some
modifications" to the existing cgroup thingy.
And just for completeness, lets look at a practical real world use case:
1 Socket
18 CPUs
4 COSIds (Yes, that's all that hardware gives me)
32 mask bits
2 hw shared bits at position 30 and 31
Have 4 CPU partitions:
CPU 0 - 5 general tasks
CPU 6 - 9 RT1
CPU 10 - 13 RT2
CPU 14 - 17 RT3
, Let each CPU partition have 1/4 of the cache.
Here is my solution:
# echo 0xff000000 > xxxx/cat/socket-0/cosid-0
# echo 0x00ff0000 > xxxx/cat/socket-0/cosid-1
# echo 0x0000ff00 > xxxx/cat/socket-0/cosid-2
# echo 0x000000ff > xxxx/cat/socket-0/cosid-3
# for CPU in 0 1 2 3 4 5; do
# echo 0 > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
# done
# CPU=6
# while [ $I -lt 18 ]; do
# let "ID = 1 + (($CPU - 6) / 4)"
# echo $ID > xxxx/cat/socket-0/cpu-$CPU/default_cosid;
# let "CPU += 1"
# done
That's it. Simple, right?
The real interesting thing here is, that you can't do that at all with that
current cgroup thingy. Simply because you run out of cosids.
Even if you have enough COSids on a newer CPU then your solution will be way
more complex and you still have not solved the issues of chasing kernel
threads etc.
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/