On Mon, Oct 26, 2020 at 04:57:55PM +0000, Szabolcs Nagy via Libc-alpha wrote:
The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
Unrolling this discussion a bit, this problem comes from a few sources:
1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.
2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.
This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.
Would it now help to add something like:
int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {
ret = mprotect(addr, len, new_flags);
}
mmap_write_unlock(current->mm);
return ret;
}
if more prot flags are introduced then the exact
match for old_flags may be restrictive and currently
there is no way to query these flags to figure out
how to toggle one prot flag in a future proof way,
so i don't think this solves the issue completely.
Ack -- I illustrated this model because it makes the seccomp filter's
job easy, but it does have limitations.
i think we might need a new api, given that aarch64
now has PROT_BTI and PROT_MTE while existing code
expects RWX only, but i don't know what api is best.
An alternative option would be a call that sets / clears chosen
flags and leaves others unchanged.
The trouble with that is that the MDWX policy then becomes hard to
implement again.
But policies might be best set via another route, such as a prctl,
rather than being implemented completely in a seccomp filter.
Cheers
---Dave