Re: [PATCH v2 0/4] Introduce user namespace capabilities

From: Josef Bacik
Date: Mon Jun 10 2024 - 16:12:53 EST


On Sun, Jun 09, 2024 at 03:43:33AM -0700, Jonathan Calmels wrote:
> This patch series introduces a new user namespace capability set, as
> well as some plumbing around it (i.e. sysctl, secbit, lsm support).
>
> First patch goes over the motivations for this as well as prior art.
>
> In summary, while user namespaces are a great success today in that they
> avoid running a lot of code as root, they also expand the attack surface
> of the kernel substantially which is often abused by attackers.
> Methods exist to limit the creation of such namespaces [1], however,
> application developers often need to assume that user namespaces are
> available for various tasks such as sandboxing. Thus, instead of
> restricting the creation of user namespaces, we offer ways for userspace
> to limit the capabilities granted to them.
>
> Why a new capability set and not something specific to the userns (e.g.
> ioctl_ns)?
>
> 1. We can't really expect userspace to patch every single callsite
> and opt-in this new security mechanism.
>
> 2. We don't necessarily want policies enforced at said callsites.
> For example a service like systemd-machined or a PAM session need to
> be able to place restrictions on any namespace spawned under it.
>
> 3. We would need to come up with inheritance rules, querying
> capabilities, etc. At this point we're just reinventing capability
> sets.
>
> 4. We can easily define interactions between capability sets, thus
> helping with adoption (patch 2 is an example of this)
>
> Some examples of how this could be leveraged in userspace:
>
> - Prevent user from getting CAP_NET_ADMIN in user namespaces under SSH:
> echo "auth optional pam_cap.so" >> /etc/pam.d/sshd
> echo "!cap_net_admin $USER" >> /etc/security/capability.conf
> capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd
>
> - Prevent containers from ever getting CAP_DAC_OVERRIDE:
> systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \
> -p SecureBits=userns-strict-caps \
> /usr/bin/dockerd
> systemd-run -p UserNSCapabilities=~CAP_DAC_OVERRIDE \
> /usr/bin/incusd
>
> - Kernel could be vulnerable to CAP_SYS_RAWIO exploits, prevent it:
> sysctl -w cap_bound_userns_mask=0x1fffffdffff
>
> - Drop CAP_SYS_ADMIN for this shell and all the user namespaces below it:
> bwrap --unshare-user --cap-drop CAP_SYS_ADMIN /bin/sh
>

Where are the tests for this patchset? I see you updated the bpf tests for the
bpf lsm bits, but there's nothing to validate this new behavior or exercise the
new ioctl you've added. Thanks,

Josef