Re: [PATCH v3 1/2] exec: add PR_HIDE_SELF_EXE prctl

From: Giuseppe Scrivano
Date: Mon Jan 23 2023 - 17:55:55 EST


"Colin Walters" <walters@xxxxxxxxxx> writes:

> On Mon, Jan 23, 2023, at 2:21 PM, Giuseppe Scrivano wrote:
>> "Colin Walters" <walters@xxxxxxxxxx> writes:
>>
>>> On Fri, Jan 20, 2023, at 5:25 AM, Giuseppe Scrivano wrote:
>>>> This patch adds a new prctl called PR_HIDE_SELF_EXE which allows
>>>> processes to hide their own /proc/*/exe file. When this prctl is
>>>> used, every access to /proc/*/exe for the calling process will
>>>> fail with ENOENT.
>>>
>>> How about a mount option for procfs like `mount -t procfs procfs /proc -o rw,nosuid,nodev,magiclink-no-xdev`
>>>
>>> Where `magiclink-no-xdev` would cause all magic links to fail to cross a pid namespace or so?
>>
>> wouldn't that break also stuff like "/proc/self/fd/$FD" after you join a
>> different PID namespace?
>
> Ah, right. Hmm. But building on the reply to
> https://lwn.net/Articles/920876/
> maybe an opt-in flag like `-o magiclink-restricted=/proc/<pid>/ns/pid`
> that required callers to have CAP_SYS_ADMIN in the referenced
> namespace? Then things like crun/runc would open a fd for
> `/proc/self/ns/pid` when they start (usually, this is the init ns),
> then pass the reference to that fd into magiclink-restricted.
>
> There may be a more elegant userspace API here; I think my overall point is reiterating what I mentioned in
> https://github.com/containers/crun/pull/1105#issuecomment-1360085059
>
> "A minor worry I have with both is that any namespacing-based approach
> that controls visibility in /proc runs the risk of someone adding a
> new way to get to the binary; maybe it's something like us having a fd
> for ourselves open under /proc/pid/fd/ or so."
>
> So instead of hiding just /proc/self/exe, we add some sort of API that aims to restrict all other magic links that may be added in the future.
>
> The history of map_files is interesting here; it looks like https://github.com/torvalds/linux/commit/bdb4d100afe9818aebd1d98ced575c5ef143456c introduces a comment that says
>
> "* Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
> * symlinks may be used to bypass permissions on ancestor directories in the
> * path to the file in question."
>
> yet isn't there an inconsistency here in not applying the same restrictions on /proc/self/exe?
>
> Or another way to look at this is that if we were to add some sort of
> API like this on /proc, we'd also change the proc_maps code to also
> honor it *instead* if present instead of limiting to the init ns?

I realize it seems like a one-off fix, but it is done only for backward
compatibility.

Other paths under /proc/self/map_files require CAP_SYS_ADMIN in the
initial user namespace, or have CAP_CHECKPOINT_RESTORE in the user
namespace. Sure, it is not future-proof, but it would look weird if
after CVE-2019-19814 there will be more ways to access files from the
host without requiring some capabilities.

One problem with having it as a mount option for procfs (so for the
/proc in the container) is that then the container runtime has the
the burden to verify that there are no other procfs mounts accessible
from the mount namespace as well as to check that the mount option is
available before attempting to use it from the container. Potentially
it could also be a TOCTOU race since a proc mount could appear later,
e.g. parent mount with shared propagation. If the runtime finds out
that another procfs is reachable, then it would have to fallback to
re-exec itself much later than what we do today.

With the prctl a runtime would just need to do the following and live
happily ever after:

__attribute__ ((constructor)) static void hide_self_exe (void)
{
if (prctl(PR_SET_HIDE_SELF_EXE, 1, 0, 0, 0) == 0)
return;

/* ...reexec as we do today... */
}

and we won't have to worry about what mount options are supported or
used by proc.