Allow preserving capabilities when changing user namespace of a process

From: Idan Yadgar
Date: Fri Sep 18 2020 - 08:30:59 EST


A process which changes its user namespace (unshare or setns), or a
process that is created by
clone with the CLONE_NEWUSER flag has all capabilities inside the new
namespace, and loses all its
capabilities in the parent/previous user namespace.
This poses an issue because some operations require a capability in a
user namespace other than the
current one for the process.
The manual states multiple times that there are system calls which
require a capability in the
initial user namespace (for example, open_by_handle_at requires
CAP_DAC_READ_SEARCH in the initial
user namespace), but this cannot happen in a user namespace other than
the initial, unless the
process is owned by root.
So if a process (with uid != 0) has CAP_DAC_READ_SEARCH capability in
the initial user namespace,
and it tries to create a new user namespace (as part of a container
for example), and use
open_by_handle_at inside - it is not possible.

Solving this problem can be done by allowing (via prctl or any other
mechanism) a task to save its
capabilities for a given user namespace, even when it isn't a member
in that namespace.

I would like to hear some thoughts about this issue and the proposed solution.