Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

From: Topi Miettinen
Date: Mon Oct 26 2020 - 12:40:23 EST


On 26.10.2020 18.24, Dave Martin wrote:
On Wed, Oct 21, 2020 at 10:44:46PM -0500, Jeremy Linton via Libc-alpha wrote:
Hi,

There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.

So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.

Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.

So, whats the best solution?

Unrolling this discussion a bit, this problem comes from a few sources:

1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.

2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.


This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.


Would it now help to add something like:

int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {

ret = mprotect(addr, len, new_flags);
}

mmap_write_unlock(current->mm);
return ret;
}


libc would now be able to do

mchangeprot(addr, len, PROT_EXEC | PROT_READ,
PROT_EXEC | PROT_READ | PROT_BTI);

while systemd's MDWX filter would reject the call if

(new_flags & PROT_EXEC) &&
(!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)



This won't magically fix current code, but something along these lines
might be better going forward.


Thoughts?

Looks good to me.

-Topi