Re: Official Linux system wrapper library?

From: Florian Weimer
Date: Sun Nov 11 2018 - 06:09:43 EST


* Michael Kerrisk:

> [adding in glibc folk for comment]
>
> On 11/10/18 7:52 PM, Daniel Colascione wrote:
>> Now that glibc is basically not adding any new system call wrappers,
>> how about publishing an "official" system call glue library as part of
>> the kernel distribution, along with the uapi headers? I don't think
>> it's reasonable to expect people to keep using syscall(__NR_XXX) for
>> all new functionality, especially as the system grows increasingly
>> sophisticated capabilities (like the new mount API, and hopefully the
>> new process API) outside the strictures of the POSIX process.
>
> As a quick glance at the glibc NEWS file shows, the above is not
> quite true:
>
> [[
> Version 2.28
> * The renameat2 function has been added...
> * The statx function has been added...
>
> Version 2.27
> * Support for memory protection keys was added. The <sys/mman.h> header now
> declares the functions pkey_alloc, pkey_free, pkey_mprotect...
> * The copy_file_range function was added.
>
> Version 2.26
> * New wrappers for the Linux-specific system calls preadv2 and pwritev2.
>
> Version 2.25
> * The getrandom [function] have been added.
> ]]
>
> I make that 11 system call wrappers added in the last 2 years.

And you missed mlock2 and memfd_create.

In some cases, we used system calls before the kernel had them (because
the kernel does not add system calls consistently across architectures).

On the other hand, this is only half of the story because distributions
do not backport system call wrappers, even those that backport kernel
implementations (or just rebase the kernel). This is something that
could be fixed eventually, but it is realted to another problem:

We had a patch for the membarrier system call, but the kernel developers
could not tell us what the system call does in therms of the C/C++
memory model, and the kernel developers and our concurrency expert could
not agree on documentation.

A lot of the new system calls lack clear specifications or are just
somewhat misdesigned. For example, pkey_alloc uses PKEY_DISABLE_WRITE
and PKEY_DISABLE_ACCESS flags (where the latter implies disabling both
read and write access), not something that matches the PROT_READ and
PROT_WRITE flags used by mmap/mprotect. This caused problems when POWER
support for pkey_alloc was added, and we are still working on resolving
that.

getrandom still causes boot delays because the kernel somehow fails to
seed its internal pool before starting PID 1 even on mainstream hardware
which has plenty of (true) randomness sources available, leading to
indefinite blocking of getrandom. It seems to me that people have
largely given up on fixing this in the upstream kernel.

For copy_file_range, we still have debates whether the system call (and
the glibc emulation) should preserve holes or not, and there a plans to
lift the cross-device restriction.

For renameat2, we already had a function in gnulib with the same name,
but which did not provide the atomic RENAME_NOREPLACE behavior for which
renameat2 was introduced.

These problems are relevant to the backporting question. One relatively
low-cost way do backport straight wrappers would be to put them as
hidden functions into libc_nonshared.a. But with these uncertainties,
this would be rather risky because fixing bugs of the wrappers would
then require relinking.

Thanks,
Florian