The 10/21/2020 22:44, Jeremy Linton wrote:
There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.
So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.
Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.
So, whats the best solution?
the easy fix in glibc is to ignore mprotect(PROT_BTI|PROT_EXEC)
failures, so programs work with seccomp filters, but bti gets
disabled (it's unreasonable to expect bti protection if mprotect
is filtered). it will be a nasty silent failure though.
and i'm also considering a fix that re-mmaps the executable
segment with PROT_BTI instead of mprotect since that is not
filtered. unfortunately the main exe is mmaped by the kernel
without PROT_BTI and the libc does not have the fd to re-mmap.
(bti can be left off for the main exe if mprotect fails and
later we can teach the kernel to add bti there.) currently
this is not a complete fix so i'm a bit hesitant about it.
as for a kernel side fix: if there is a way to only filter
PROT_EXEC mprotect on mappings that are not yet PROT_EXEC
that would solve this problem (but likely needs new syscall
or seccomp capability).