Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

From: Topi Miettinen
Date: Thu Oct 22 2020 - 18:24:41 EST


On 22.10.2020 23.02, Kees Cook wrote:
On Thu, Oct 22, 2020 at 01:39:07PM +0300, Topi Miettinen wrote:
But I think SELinux has a more complete solution (execmem) which can track
the pages better than is possible with seccomp solution which has a very
narrow field of view. Maybe this facility could be made available to
non-SELinux systems, for example with prctl()? Then the in-kernel MDWX could
allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file hasn't been
modified, the source filesystem isn't writable for the calling process and
the file descriptor isn't created with memfd_create().

Right. The problem here is that systemd is attempting to mediate a
state change using only syscall details (i.e. with seccomp) instead of
a stateful analysis. Using a MAC is likely the only sane way to do that.
SELinux is a bit difficult to adjust "on the fly" the way systemd would
like to do things, and the more dynamic approach seen with SARA[1] isn't
yet in the kernel.

SARA looks interesting. What is missing is a prctl() to enable all W^X protections irrevocably for the current process, then systemd could enable it for services with MemoryDenyWriteExecute=yes.

I didn't also see specific measures against memfd_create() or file system W&X, but perhaps those can be added later. Maybe pkey_mprotect() is not handled either unless it uses the same LSM hook as mprotect().

Trying to enforce memory W^X protection correctly
via seccomp isn't really going to work well, as far as I can see.

Not in general, but I think it can work well in context of system services. Then you can ensure that for a specific service, memfd_create() is blocked by seccomp and the file systems are W^X because of mount namespaces etc., so there should not be any means to construct arbitrary executable pages.

-Topi