Re: BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

From: Dave Martin
Date: Wed Nov 04 2020 - 07:19:04 EST


On Thu, Oct 29, 2020 at 11:02:22AM +0000, Catalin Marinas via Libc-alpha wrote:
> On Tue, Oct 27, 2020 at 02:15:22PM +0000, Dave P Martin wrote:
> > I also wonder whether we actually care whether the pages are marked
> > executable or not here; probably the flags can just be independent. This
> > rather depends on whether the how the architecture treats the BTI (a.k.a
> > GP) pagetable bit for non-executable pages. I have a feeling we already
> > allow PROT_BTI && !PROT_EXEC through anyway.
> >
> >
> > What about a generic-ish set/clear interface that still works by just
> > adding a couple of PROT_ flags:
> >
> > switch (flags & (PROT_SET | PROT_CLEAR)) {
> > case PROT_SET: prot |= flags; break;
> > case PROT_CLEAR: prot &= ~flags; break;
> > case 0: prot = flags; break;
> >
> > default:
> > return -EINVAL;
> > }
> >
> > This can't atomically set some flags while clearing some others, but for
> > simple stuff it seems sufficient and shouldn't be too invasive on the
> > kernel side.
> >
> > We will still have to take the mm lock when doing a SET or CLEAR, but
> > not for the non-set/clear case.
> >
> >
> > Anyway, libc could now do:
> >
> > mprotect(addr, len, PROT_SET | PROT_BTI);
> >
> > with much the same effect as your PROT_BTI_IF_X.
> >
> >
> > JITting or breakpoint setting code that wants to change the permissions
> > temporarily, without needing to know whether PROT_BTI is set, say:
> >
> > mprotect(addr, len, PROT_SET | PROT_WRITE);
> > *addr = BKPT_INSN;
> > mprotect(addr, len, PROT_CLEAR | PROT_WRITE);
>
> The problem with this approach is that you can't catch
> PROT_EXEC|PROT_WRITE mappings via seccomp. So you'd have to limit it to
> some harmless PROT_ flags only. I don't like this limitation, nor the
> PROT_BTI_IF_X approach.

Ack; this is just one flavour of interface, and every approach seems to
have some shortcomings.

> The only generic solutions I see are to either use a stateful filter in
> systemd or pass the old state to the kernel in a cmpxchg style so that
> seccomp can check it (I think you suggest this at some point).

The "cmpxchg" option has the disadvantage that the caller needs to know
the original permissions. It seems that glibc is prepared to work
around this, but it won't always be feasible in ancillary /
instrumentation code or libraries.

IMHO it would be preferable to apply a policy to mmap/mprotect in the
kernel proper rather then BPF being the only way to do it -- in any
case, the required checks seem to be out of the scope of what can be
done efficiently (or perhaps at all) in a syscall filter.

> The latter requires a new syscall which is not something we can address
> as a quick, back-portable fix here. If systemd cannot be changed to use
> a stateful filter for w^x detection, my suggestion is to go for the
> kernel setting PROT_BTI on the main executable with glibc changed to
> tolerate EPERM on mprotect(). I don't mind adding an AT_FLAGS bit if
> needed but I don't think it buys us much.

I agree, this seems the best short-term approach.

> Once the current problem is fixed, we can look at a better solution
> longer term as a new syscall.

Agreed, I think if we try to rush the addition of new syscalls, the
chance of coming up with a bad design is high...

Cheers
---Dave