Re: [RFC] x86: restrict pid namespaces to 32 or 64 bit syscalls

From: Vasiliy Kulikov
Date: Sat Aug 13 2011 - 02:23:01 EST

On Fri, Aug 12, 2011 at 15:08 -0500, H. Peter Anvin wrote:
> On 08/12/2011 10:03 AM, Vasiliy Kulikov wrote:
> > This patch allows x86-64 systems with 32 bit syscalls support to lock a
> > pid namespace to 32 or 64 bitness syscalls/tasks. By denying rarely
> > used compatibility syscalls it reduces an attack surface for 32 bit
> > containers.
> >
> > The new sysctl is introduced, abi.bitness_locked. If set to 1, it locks
> > all tasks inside of current pid namespace to the bitness of init task
> > (pid_ns->child_reaper). After that:
> >
> > 1) a task trying to do a syscall of other bitness would get a signal as
> > if the corresponding syscall is not enabled (IDT entry/MSR is not
> > initialized).
> >
> > 2) loading ELF binaries of another bitness is prohibited (as if the
> > corresponding CONFIG_BINFMT_*=N).
> However, I have to question the value of this... if this is enabled in
> the system as a whole (as opposed to compiled out) it seems kind of
> pointless...

No, it is not for the system as a whole, but for containers (however,
it's possible to lock the whole system). We use OpenVZ kernels with
multiple containers, some of them are 32 bit, some are 64 bit. 64 bit
syscalls are not needed for 32 bit containers and 32 bit syscalls are
not needed for 64 bit containers. As a needless interfaces they
unreasonably increase the kernel attack surface. Some compatibility 32
bit syscalls are rarely used, sometimes they are not tested well.

In IA-64 the IA-32 compatibility support was broken for 2 years:

In amd64 some specific rarely used syscalls might behave similar way.
Removing this attack vector is the goal of the patch.

> if there are bugs we need to deal with them anyway.


> > Qestions/thoughts:
> >
> > The patch adds a check in syscalls code. Is it a significant
> > slowdown for fast syscalls? If so, probably it worth moving the check
> > into scheduler code and enabling/disabling corresponding interrupt/MSRs
> > on each task switch?
> >
> *YOU* are the person who needs to answer that question by providing
> measurements. Quite frankly I suspect checks in the syscall code *or*
> task switching MSRs are going to be unacceptable from a performance
> point of view.

OK, I'll do it.

Thank you,

Vasiliy Kulikov - bringing security into open computing environments
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at