Re: [RFC PATCH 0/3] memfd: cleanups for vm.memfd_noexec

From: Aleksa Sarai
Date: Wed Aug 02 2023 - 17:49:14 EST


On 2023-08-02, Jeff Xu <jeffxu@xxxxxxxxxxxx> wrote:
> > > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
> > > > > > because it will make it far to difficult to ever migrate. Instead it
> > > > > > should imply MFD_EXEC.
> > > > > >
> > > > > Though the purpose of memfd_noexec=2 is not to help with migration -
> > > > > but to disable creation of executable memfd for the current system/pid
> > > > > namespace.
> > > > > During the migration, vm.memfd_noexe = 1 helps overwriting for
> > > > > unmigrated user code as a temporary measure.
> > > >
> > > > My point is that the current behaviour for =2 means that nobody other
> > > > than *maybe* ChromeOS will ever be able to use it because it requires
> > > > auditing every program on the system. In fact, it's possible even
> > > > ChromeOS will run into issues given that one of the arguments made for
> > > > the nosymfollow mount option was that auditing all of ChromeOS to
> > > > replace every open with RESOLVE_NO_SYMLINKS would be too much effort[1]
> > > > (which I agreed with). Maybe this is less of an issue with
> > > > memfd_create(2) (which is much newer than open(2)) but it still seems
> > > > like a lot of busy work when the =1 behaviour is entirely sane even in
> > > > the strict threat model that =2 is trying to protect against.
> > > >
> > > It can also be a container (that have all memfd_create migrated to new API)
> >
> > If ChromeOS would struggle to rewrite all of the libraries they use,
> > containers are in even worse shape -- most container users don't have a
> > complete list of every package installed in a container, let alone the
> > ability to audit whether they pass a (no-op) flag to memfd_create(2) in
> > every codepath.
> >
> > > One option I considered previously was "=2" would do overwrite+block ,
> > > and "=3" just block. But then I worry that applications won't have
> > > motivation to ever change their existing code, the setting will
> > > forever stay at "=2", making "=3" even more impossible to ever be used
> > > system side.
> >
> > What is the downside of overwriting? Backwards-compatibility is a very
> > important part of Linux -- being able to use old programs without having
> > to modify them is incredibly important. Yes, this behaviour is opt-in --
> > but I don't see the point of making opting in more difficult than
> > necessary. Surely overwite+block provides the security guarantee you
> > need from the threat model -- othewise nobody will be able to use block
> > because you never know if one library will call memfd_create()
> > "incorrectly" without the new flags.
> >
> >
> > > > If you want to block syscalls that don't explicitly pass NOEXEC_SEAL,
> > > > there are several tools for doing this (both seccomp and LSM hooks).
> > > >
> > > > [1]: https://lore.kernel.org/linux-fsdevel/20200131212021.GA108613@xxxxxxxxxx/
> > > >
> > > > > Additional functionality/features should be implemented through
> > > > > security hook and LSM, not sysctl, I think.
> > > >
> > > > This issue with =2 cannot be fixed in an LSM. (On the other hand, you
> > > > could implement either =2 behaviour with an LSM using =1, and the
> > > > current strict =2 behaviour could be implemented purely with seccomp.)
> > > >
> > > By migration, I mean a system that is not fully migrated, such a
> > > system should just use "=0" or "=1". Additional features can be
> > > implemented in SELinux/Landlock/other LSM by a motivated dev. e.g. if
> > > a system wants to limit executable memfd to specific programs or fully
> > > disable it.
> > > "=2" is for a system/container that is fully migrated, in that case,
> > > SELinux/Landlock/LSM can do the same, but sysctl provides a convenient
> > > alternative.
> > > Yes, seccomp provides a similar mechanism. Indeed, combining "=1" and
> > > seccomp (block MFD_EXEC), it will overwrite + block X mfd, which is
> > > essentially what you want, iiuc.However, I do not wish to have this
> > > implemented in kernel, due to the thinking that I want kernel to get
> > > out of business of "overwriting" eventually.
> >
> > See my above comments -- "overwriting" is perfectly acceptable to me.
> > There's also no way to "get out of the business of overwriting" -- Linux
> > has strict backwards compatibility requirements.
> >
>
> I agree, if we weigh on the short term goal of letting the user space
> applications to do minimum, then having 4 state sysctl (or 2 sysctl,
> one controls overwrite, one disable/enable executable memfd) will do.
> But with that approach, I'm afraid a version of the future (say in 20
> years), most applications stays with memfd_create with the old API
> style, not setting the NX bit. With the current approach, it might seem
> to be less convenient, but I hope it offers a bit of incentive to make
> applications migrating their code towards the new API, explicitly
> setting the NX bit. I understand this hope is questionable, we might
> still end up the same in 20 years, but at least I tried :-). I will
> leave this decision to maintainers when you supply patches for that,
> and I wouldn't feel bad either way, there is a valid reason on both sides.

People will not switch =2 on if it has the possibility of breaking
existing programs that are doing nothing wrong by not passing a noop
flag.

In 20 years at best you would have =1 in widespread use because the
rewriting behaviour is what users expect of kernel uAPIs. They expect
old programs to work without modifying them if they aren't doing
anything wrong. A uAPI knob that requires every userspace program to
change before you can safely enable it (especially because it ratchets
in a way that makes it dangerous to enable on production machines) will
simply not be used.

If the goal is to get programs to update (which it seems it is), having
a knob that nobody will turn on doesn't help. Doing proper warning
logging is the way to get userspace to switch -- userspace usually
notices when their programs trigger warnings in dmesg.

> To supplement, there are two other ways for what you want:
> 1> seccomp to block MFD_EXEC, and leaving the setting to 1.

I made this point in an earlier mail.

However my point is that =2 is not an acceptable uAPI and if you want
something that looks like =2 you can also implement that with seccomp
too!

In fact, the key difference is that you cannot implement the
rewriting easily in seccomp -- you would need to install a
seccomp_notify monitor that does nothing but rewrite syscall arguments.
This would be equivalent to running the entire system under GDB to work
around a uAPI flaw.

> 2> implement the blocking using a security hook and LSM, imo, which is
> probably the most common way to deal with this type of request (block
> something).

The issue is not the blocking, it's the rewriting.

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

Attachment: signature.asc
Description: PGP signature