Re: [BREAKAGE] Since 4.18, kernel sets SB_I_NODEV implicitly on userns mounts, breaking systemd-nspawn
From: Gabriel C
Date: Sat Dec 22 2018 - 16:00:12 EST
Added some people to CC that might want to see this..
Am Sa., 22. Dez. 2018 um 19:14 Uhr schrieb Ellie Reeves <ellierevves@xxxxxxxxx>:
>
> Hi,
> first off, allow me to express that this is my first time ever writing
> on such a mailing list, and that if something is unclear or you would
> need more information, just let me know.
> I write to this list in hoping to see this change reverted. The linux
> kernel always said it would avoid breaking user namespace as much as
> possible, and yet this is what happens. I was hence very much surprised
> when my perfectly working containers on systemd-nspawn which makes use
> of userns by default, stopped working from one day to the next, till I
> identified the problem as being kernel >= 4.18. This container is in
> production, hence the annoyance it was. From one day to the next the
> container started failing with stranges problems:
>
> * nginx, dovecot, postgresql, and postfix complained about getting
> permission denied on /dev/null even though it appeared perfectly normal
> to me, the correct permissions, all that
> * /var was also acting very strangely, getting a lot of permission
> denied or operation not supported messages.
> * I could not delete a file that my user had the right to create, write
> to and read in /var, I needed root
>
> Here is the pull request that was made to systemd, along with a small
> amount of talk around the issue:
>
> https://github.com/systemd/systemd/pull/9483
>
> It was ultimately decided among the systemd folks to bail out of the
> issue, as shown in the news entry for systemd 240:
>
> * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour
> regarding
> mknod() handling in user namespaces. Previously mknod() would
> always
> fail with EPERM in user namespaces. Since 4.18 mknod() will
> succeed
> but device nodes generated that way cannot be opened, and
> attempts to
> open them result in EPERM. This breaks the "graceful
> fallback" logic
> in systemd's PrivateDevices= sand-boxing option. This option is
> implemented defensively, so that when systemd detects it runs
> in a
> restricted environment (such as a user namespace, or an
> environment
> where mknod() is blocked through seccomp or absence of
> CAP_SYS_MKNOD)
> where device nodes cannot be created the effect of
> PrivateDevices= is
> bypassed (following the logic that 2nd-level sand-boxing is not
> essential if the system systemd runs in is itself already
> sand-boxed
> as a whole). This logic breaks with 4.18 in container
> managers where
> user namespacing is used: suddenly PrivateDevices= succeeds
> setting
> up a private /dev/ file system containing devices nodes â but
> when
> these are opened they don't work.
>
> At this point is is recommended that container managers utilizing
> user namespaces that intend to run systemd in the payload
> explicitly
> block mknod() with seccomp or similar, so that the graceful
> fallback
> logic works again.
>
> We are very sorry for the breakage and the requirement to change
> container configurations for newer kernels. It's purely
> caused by an
> incompatible kernel change. The relevant kernel developers
> have been
> notified about this userspace breakage quickly, but they chose to
> ignore it.
>
> Here's an email that was sent to lkml about the subject:
>
> https://lkml.org/lkml/2018/7/5/742
>
> I link also this, quoting the last of it:
>
> https://lkml.org/lkml/2018/7/5/701
>
> It has never been the case that mknod on a device node will guarantee
> that you even can open the device node. The applications that regress
> are broken. It doesn't mean we shouldn't be bug compatible, but we darn
> well should document very clearly the bugs we are being bug compatible with.
>
> I'm in the opinion that it is a kernel bug, and I quote someone from the
> systemd irc channel:
>
> ewb said applications were broken. But the rule is, if userspace breaks,
> its a bug. The kernel *has* to revert it. And honestly, this change
> doesn't make much sense. You can set nodev yourself but then you know
> mknod will not allow you to open the object. Here, the kernel does it
> without your knowledge
>
> Also, it seems that if this change is reverted, things that were fixed
> to work around the issue this breakage caused will not be broken again,
> they should simply go back to their previous way of working. I
> understand there may be security reason why this change was made in the
> first place, but it is not so big a problem is it ? I can mknode
> arbitrary devices in userns and open them as userns root. But my point
> is, several things broke. My *working* stuff was broken from one day to
> the next.
>
> I am not trying to pick a fight. I want to understand the reasoning
> behind this change in the first place, and I'm simply making an attempt
> at getting it reverted, because it is true that I don't much fancy
> blocking the mknode() syscall in every template unit on every machine we
> administer here, and that staying on kernel < 4.18 is not a good
> sollution either.
>
> I would also like to be personally CC'ed the comments or answers posted
> to this mailing list in response to this message.
>
> Thanks