Re: [RFC PATCH 0/5] madvise MADV_DOEXEC

From: Steven Sistare
Date: Fri Jul 31 2020 - 12:14:51 EST


On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
>> Matthews sileby/mshare proposal has the same issue. If a process opts-in
>> and mmap's an address in the shared region, then content becomes mapped at
>> a VA that was known to the pre-fork or pre-exec process. Trust must still
>> be established.
>
> It's up to the recipient whether they try to map it at the same address
> or at a fresh address. The intended use case is a "semi-shared" address
> space between two processes (ie partway between a threaded, fully-shared
> address space and a forked un-shared address space), in which case
> there's a certain amount of trust and cooperation between the processes.

Understood, but if the recipient does map at any of the same, which is the whole
point because you want to share the page table. The trust relationship is no
different than for the live update case.

> Your preservation-across-exec use-case might or might not need the
> VMA to be mapped at the same address.

It does. qemu registers memory with vfio which remembers the va's in kernel
metadata for the device.

> I don't know whether qemu stores
> pointers in this VMA which are absolute within the qemu address space.
> If it's just the emulated process's address space, then everything will
> be absolute within its own address space and everything will be opaque
> to qemu. If qemu is storing its own pointers in it, then it has to be
> mapped at the same address.

qemu does not do the latter but that could be a nice way for apps to use
preserved memory.

>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> for (;;) {
>>> handle = dlopen("/my/real/program");
>>> real_main = dlsym(handle, "main");
>>> real_main(argc, argv, envp);
>>> dlclose(handle);
>>> }
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code. It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point.
>>
>> Thanks for the suggestion. That is clever, and would make a fun project,
>> but I would not trust it for production. These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs. Getting a
>> clean slate from a kernel exec is a much more reliable design. The use
>> case is creating long-lived apps that never go down, and the simplest
>> implementation will have the fewest bugs and is the best. MADV_DOEXEC is
>> simple, and does not even require a new system call, and the kernel already
>> knows how to exec without bugs.
>
> It's a net increase of 200 lines of kernel code. If 4 lines of userspace
> code removes 200 lines of kernel code, I think I know which I prefer ...

It will be *far* more than 4 lines.
Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

- Steve