[BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn
From: Ellie Revves
Date: Sat Dec 22 2018 - 12:20:53 EST
Hi,
first off, allow me to express that this is my first time ever writing
on such a mailing list, and that if something is unclear or you would
need more information, just let me know.
I write to this list in hoping to see this change reverted. The linux
kernel always said it would avoid breaking user namespace as much as
possible, and yet this is what happens. I was hence very much surprised
when my perfectly working containers on systemd-nspawn which makes use
of userns by default, stopped working from one day to the next, till I
identified the problem as being kernel >= 4.18. This container is in
production, hence the annoyance it was. From one day to the next the
container started failing with stranges problems:
* nginx, dovecot, postgresql, and postfix complained about getting
permission denied on /dev/null even though it appeared perfectly normal
to me, the correct permissions, all that
* /var was also acting very strangely, getting a lot of permission
denied or operation not supported messages.
* I could not delete a file that my user had the right to create, write
to and read in /var, I needed root
Here is the pull request that was made to systemd, along with a small
amount of talk around the issue:
https://github.com/systemd/systemd/pull/9483
It was ultimately decided among the systemd folks to bail out of the
issue, as shown in the news entry for systemd 240:
ÂÂÂÂÂÂÂ * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour
regarding
ÂÂÂÂÂÂÂÂÂ mknod() handling in user namespaces. Previously mknod() would
always
ÂÂÂÂÂÂÂÂÂ fail with EPERM in user namespaces. Since 4.18 mknod() will
succeed
ÂÂÂÂÂÂÂÂÂ but device nodes generated that way cannot be opened, and
attempts to
ÂÂÂÂÂÂÂÂÂ open them result in EPERM. This breaks the "graceful
fallback" logic
ÂÂÂÂÂÂÂÂÂ in systemd's PrivateDevices= sand-boxing option. This option is
ÂÂÂÂÂÂÂÂÂ implemented defensively, so that when systemd detects it runs
in a
ÂÂÂÂÂÂÂÂÂ restricted environment (such as a user namespace, or an
environment
ÂÂÂÂÂÂÂÂÂ where mknod() is blocked through seccomp or absence of
CAP_SYS_MKNOD)
ÂÂÂÂÂÂÂÂÂ where device nodes cannot be created the effect of
PrivateDevices= is
ÂÂÂÂÂÂÂÂÂ bypassed (following the logic that 2nd-level sand-boxing is not
ÂÂÂÂÂÂÂÂÂ essential if the system systemd runs in is itself already
sand-boxed
ÂÂÂÂÂÂÂÂÂ as a whole). This logic breaks with 4.18 in container
managers where
ÂÂÂÂÂÂÂÂÂ user namespacing is used: suddenly PrivateDevices= succeeds
setting
ÂÂÂÂÂÂÂÂÂ up a private /dev/ file system containing devices nodes â but
when
ÂÂÂÂÂÂÂÂÂ these are opened they don't work.
ÂÂÂÂÂÂÂÂÂ At this point is is recommended that container managers utilizing
ÂÂÂÂÂÂÂÂÂ user namespaces that intend to run systemd in the payload
explicitly
ÂÂÂÂÂÂÂÂÂ block mknod() with seccomp or similar, so that the graceful
fallback
ÂÂÂÂÂÂÂÂÂ logic works again.
ÂÂÂÂÂÂÂÂÂ We are very sorry for the breakage and the requirement to change
ÂÂÂÂÂÂÂÂÂ container configurations for newer kernels. It's purely
caused by an
ÂÂÂÂÂÂÂÂÂ incompatible kernel change. The relevant kernel developers
have been
ÂÂÂÂÂÂÂÂÂ notified about this userspace breakage quickly, but they chose to
ÂÂÂÂÂÂÂÂÂ ignore it.
Here's an email that was sent to lkml about the subject:
https://lkml.org/lkml/2018/7/5/742
I link also this, quoting the last of it:
https://lkml.org/lkml/2018/7/5/701
It has never been the case that mknod on a device node will guarantee
that you even can open the device node. The applications that regress
are broken. It doesn't mean we shouldn't be bug compatible, but we darn
well should document very clearly the bugs we are being bug compatible with.
I'm in the opinion that it is a kernel bug, and I quote someone from the
systemd irc channel:
ewb said applications were broken. But the rule is, if userspace breaks,
its a bug. The kernel *has* to revert it. And honestly, this change
doesn't make much sense. You can set nodev yourself but then you know
mknod will not allow you to open the object. Here, the kernel does it
without your knowledge
Also, it seems that if this change is reverted, things that were fixed
to work around the issue this breakage caused will not be broken again,
they should simply go back to their previous way of working. I
understand there may be security reason why this change was made in the
first place, but it is not so big a problem is it ? I can mknode
arbitrary devices in userns and open them as userns root. But my point
is, several things broke. My *working* stuff was broken from one day to
the next.
I am not trying to pick a fight. I want to understand the reasoning
behind this change in the first place, and I'm simply making an attempt
at getting it reverted, because it is true that I don't much fancy
blocking the mknode() syscall in every template unit on every machine we
administer here, and that staying on kernel < 4.18 is not a good
sollution either.
I would also like to be personally CC'ed the comments or answers posted
to this mailing list in response to this message.
Thanks