Re: [RFC PATCH] fs/compat_binfmt_elf: Introduce sysctl to disable compat ELF loader

From: Kees Cook
Date: Thu Sep 16 2021 - 11:56:41 EST


On Thu, Sep 16, 2021 at 04:13:37PM +0100, Will Deacon wrote:
> Hi Arnd,
>
> On Thu, Sep 16, 2021 at 04:46:15PM +0200, Arnd Bergmann wrote:
> > On Thu, Sep 16, 2021 at 3:18 PM Will Deacon <will@xxxxxxxxxx> wrote:
> > >
> > > Distributions such as Android which support a mixture of 32-bit (compat)
> > > and 64-bit (native) tasks necessarily ship with the compat ELF loader
> > > enabled in their kernels. However, as time goes by, an ever-increasing
> > > proportion of userspace consists of native applications and in some cases
> > > 32-bit capabilities are starting to be removed from the CPUs altogether.
> > >
> > > Inevitably, this means that the compat code becomes somewhat of a
> > > maintenance burden, receiving less testing coverage and exposing an
> > > additional kernel attack surface to userspace during the lengthy
> > > transitional period where some shipping devices require support for
> > > 32-bit binaries.
> > >
> > > Introduce a new sysctl 'fs.compat-binfmt-elf-enable' to allow the compat
> > > ELF loader to be disabled dynamically on devices where it is not required.
> > > On arm64, this is sufficient to prevent userspace from executing 32-bit
> > > code at all.
> > >
> > > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> > > Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> > > Cc: Arnd Bergmann <arnd@xxxxxxxx>
> > > Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> > > Cc: Kees Cook <keescook@xxxxxxxxxxxx>
> > > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > Signed-off-by: Will Deacon <will@xxxxxxxxxx>
> > > ---
> > > fs/compat_binfmt_elf.c | 24 +++++++++++++++++++++++-
> > > 1 file changed, 23 insertions(+), 1 deletion(-)
> > >
> > > I started off hacking this into the arch code, but then I realised it was
> > > just as easy doing it in the core for everybody to enjoy. Unfortunately,
> > > after talking to Peter, it sounds like it doesn't really help on x86
> > > where userspace can switch to 32-bit without involving the kernel at all.
> > >
> > > Thoughts?
> >
> > I'm not sure I understand the logic behind the sysctl. Are you worried
> > about exposing attack surface on devices that don't support 32-bit
> > instructions at all but might be tricked into loading a 32-bit binary that
> > exploits a bug in the elf loader, or do you want to remove compat support
> > on some but not all devices running the same kernel?
>
> It's the latter case. With the GKI effort in Android, we want to run the
> same kernel binary across multiple devices. However, for some devices
> we may be able to determine that there is no need to support 32-bit
> applications even though the hardware may support them, and we would
> like to ensure that things like the compat syscall wrappers, compat vDSO,
> signal handling etc are not accessible to applications.

I like the idea! I wonder if the binfmts should have an "enabled" flag
instead? This would make it not compat_binfmt_elf-specific, and would
avoid a new "special" sysfs flag:

static bool enabled = 1;
module_param(enabled, bool, 0600);
MODULE_PARM_DESC(enabled, "Whether this binfmt available for loading");

Then:
echo 0 > /sys/module/compat_binfmt_elf/enabled

>
> > In the first case, having the kernel make the decision based on CPU
> > feature flags would be easier. In the second case, I would expect this
> > to be a per-process setting similar to prctl, capability or seccomp.
> > This would make it possible to do it for separately per container
> > and avoid ambiguity about what happens to already-running 32-bit
> > tasks.
>
> I'm not sure I follow the per-process aspect of your suggestion -- we want
> to prevent 32-bit tasks from existing at all. If it wasn't for GKI, we'd
> just disable CONFIG_COMPAT altogether, but while there is a need for 32-bit
> support on some devices then we're not able to do that.

It's possible to do process-hierarchy-controlled compat-restriction on
all architectures with an seccomp ARCH test. For example:

BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL_PROCESS)
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)

This filter will have fixed tiny overhead because of the automatic
seccomp bitmaps.

FWIW, systemd exposes this feature via "SystemCallArchitectures=native".

This doesn't stop the loader attack surface, though, so I think
something to control that makes sense.

--
Kees Cook