Quoting Glauber Costa (glommer@xxxxxxxxxxxxx):
Can't we just introduce the
/sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
if cgroups are enabled and the task's memory cgroup != '/', return
the data from that file?
First: If we're doing that, why do we need that file in the first place?
We might not :) But we might, if we want to offer containers a choice of
whether /proc/meminfo is the host's or the container's.
I am sorry, I think I missed you mentioning this file.The file is useful if we're bind mounting, but if we're
automatically displaying it according to any criteria, not that
interesting. Well, it would allow the root container to view it, so
maybe it is in fact interesting...
As for cgroup != '/', I am not sure if it works. Well, for
containers, it works beautifully. But what we have in the kernel now
is a mechanism for resource control (cgroups) and a mechanism for
isolation (namespaces). Displaying data falls in the isolation
realm. There are users using just the resource control part (think
of systemd). I doubt they'd like to suddenly, after years expecting
system-wide info, read per-cgroup data when querying a /proc file.
That's where the /sys/fs/cgroup/memory/memory.use_cgroup_as_proc file
I mentioned below would come in. The host could choose to give
that application the host /proc/meminfo view.
Think /proc/stat, the file I am working now, as an example.
Still, if the applications you are thinking of are having their
resources restricted, what harm would come of reporting their actual
allotted resources in place of an artificially inflated number?
So, because I'm all for automatic, is that I am proposing this. I
think we need a mechanism to tie a cgroup to a namespace (or many,
one of each kind).
I myself can settle down for:
* If namespace != '/' => show cgroup information instead of
system-wide. (What do you think?)
I don't like it :)
The namespaces are about name->object relations, not just about
isolation. In contrast, the cgroups are precisely about resource
The only reason I proposed anything more complicated than that, is
that I was fearing there were weirdos out there for whom "every
process in a cgroup is in the same namespace" wouldn't hold, and
they'd want to opt this out. But I honestly think this is a very
Don't get me wrong, I don't think it would hurt to always give them
the cgroup data. I just think the check is not 'correct'.
We might also want to have a /sys/fs/cgroup/memory/memory.show_proc_data
(etc) file which defaults to 1 (show the cgroup's file data in place of
/proc/meminfo), which can be set to 0 on the host so that the container,
if it wants, can see the host's data.
This idea is almost setup-free (with the exception of dumping pids
into the cgroup files, but if the files are default for all cgroups,
a 3-line loop can do it in a very future-proof way). But in reality,
what appeals to me about it, is that it is a mechanism for coupling
entities that in our case, should be the same. It provides stronger
guarantees that we will never be able to see any data outside the
ones we are untitled to, even we get the bind mounts setup wrongly.
(disclaimer: wild idea ahead)
If we, for instance, code in such a way that if a certain proc-file
is per-namespace, the task could get no data at all unless a
cgroup-binding is set, providing stronger isolation guarantees.
Are there good reasons to worry about guaranteeing this particular
isolation? My impression was that this stuff is useful for the
application - the better it can calculate the resources available
to it, the better it can get along with others avoid getting killed
later. But I didn't think our goal was to try and hide the host
info from the container - we just want to give it most meaningful
First of all, note that I am not overly concerned about that.
But it may prove useful.
If I am in a container side by side with yours, I'd prefer you wouldn't
be able to guess anything about me, including my workload type,
memory usage, etc, and this could be used by clever exploiters.
Besides, /proc holds all sorts of stuff. Networking routing tables
and connection status, for example. Those are not just statistics,
and should maybe be totally hidden.
I think that should be done separate from this whole discussion - using
user namespaces. Any task in a non-initial user namespace will only
get the world access rights to a procfile. So if the file isn't world
readable, then a container won't be able to read it.
(That's probably also why this stuff has been languishing - it's
rather low in priority because unlike other things it won't harm
Agreed about that. But hey, at some point it has to be done...