Re: [PATCH v2] xattr: Enable security.capability in user namespaces

From: Stefan Berger
Date: Thu Jul 13 2017 - 13:06:01 EST

On 07/13/2017 12:40 PM, Theodore Ts'o wrote:
On Thu, Jul 13, 2017 at 07:11:36AM -0500, Eric W. Biederman wrote:
The concise summary:

Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.

User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
variant of the form:

It's not like there will be:

A file shared by 2 containers, one mapping root to uid=1000, the other mapping root to uid=2000, will show these two xattrs on the host (init_user_ns) once these containers set xattrs on that file.

Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.

Note that we do keep compatibility to existing behavior. The of the host is visible inside any container for as long as the container root user doesn't set its own on that file, which then hides it. Does that address this concern?

So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?

So this hack of using is *only* useful when the
subcontainer root wants to create the privileged executable. You
still have to do things the other way.

So can we make perhaps the assertion that *either*:

exists, *or*

exists, but never both? And there BAR is exclusive to only one

In the current implementation BAR is visible inside of any instance that 'covers' this uid with the mapping range. Above example of appears as inside the container with root mapping to uid 1000 (@uid=0 is suppressed) but also appears as with root uid mapping to 900 (and range of at least 101).

Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.

Can you define what 'scalable' means for you in this context?
From what I can see sharing a filesystem between multiple containers doesn't 'scale well' for virtualizing the xattrs primarily because of size limitations of xattrs per file.