Re: Review request: draft ioctl_userfaultfd(2) manual page

From: Andrea Arcangeli
Date: Wed May 03 2017 - 17:46:27 EST


On Fri, Apr 21, 2017 at 11:11:18AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Mike,
> Hello Andrea (we need your help!),

Sorry for not answering sooner! (I had a vacation last week)

>
> On 03/22/2017 02:54 PM, Mike Rapoport wrote:
> >> The various ioctl(2) operations are described below. The UFFDIO_API,
> >> UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
> >> userfaultfd behavior. These operations allow the caller to choose what
> >> features will be enabled and what kinds of events will be delivered to
> >> the application. The remaining operations are range operations. These
> >> operations enable the calling application to resolve page-fault events
> >> in a consistent way.
> >>
> >>
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âFIXME â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âAbove: What does "consistent" mean? â
> >> â â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >
> > Andrea, can you please help with this one?
>
> Let's see what Andrea has to say.

I think it doesn't mean anything and I see you already removed it, fine!

> So, the thing that was not clear, but now I think I understand:
> 'features' is an input field where one can ask about supported features
> (but none are supported, before Linux 4.11). Is that correct? I've changed
> the text here to read:
>
> Before the call, the features field must be initialized
> to zero. In the future, it is intended that this field can be
> used to ask whether particular features are supported.
>
> Seem okay?

Yes, but in reality nothing has changed. Simply the kernels before
4.11 had no feature support at all.

===
Starting from Linux 4.11, the features field can be used to ask
whether particular features are supported and explicitly enable
userfaultfd features that are disabled by default. The kernel
always reports all the available features in the features
field.
=====

I would prefer if we removed this 4.11 difference here.

We should be able to describe it simply as:

"The features field can be used to ask whether particular features are
supported and explicitly enable userfaultfd features that are disabled
by default. The kernel always reports all the available features in
the features field."

The whole point of this feature flag thing, is so the app at runtime
can check if the feature is available and ask for it. The fact kernels
before 4.11 don't support any feature is a detail.

> > There's a check in uffdio_api call that the user is not trying to enable
> > any other functionality and it asserts that uffdio_api.featurs is zero [1].
> > Starting from 4.11 the features negotiation is different. Now uffdio_call
> > verifies that it can support features the application requested [2].
>
> Okay.

I don't like the differentiation here between 4.11 and before, because
from user point of view nothing has changed.

I think this description is enough " Since Linux 4.11, the
following feature bits may be set: " and no other mention of 4.11 is
needed in the manpage. It looks an unnecessary complication to the reader.

>
> >> The kernel verifies that it can support the requested API version, and
> >> sets the features and ioctls fields to bit masks representing all the
> >> available features and the generic ioctl(2) operations available. Curâ
> >> rently, zero (i.e., no feature bits) is placed in the features field.
> >> The returned ioctls field can contain the following bits:
> >>
> >>
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âFIXME â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âThis user-space API seems not fully polished. Why â
> >> âare there not constants defined for each of the bit- â
> >> âmask values listed below? â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >>
> >> 1 << _UFFDIO_API
> >> The UFFDIO_API operation is supported.
> >>
> >> 1 << _UFFDIO_REGISTER
> >> The UFFDIO_REGISTER operation is supported.
> >>
> >> 1 << _UFFDIO_UNREGISTER
> >> The UFFDIO_UNREGISTER operation is supported.
> >
> > Well, I tend to agree. I believe the original intention was to use the
> > OR'ed mask, like UFFD_API_IOCTLS.
> > Andrea, can you add somthing?
>
> Yes, Andrea, please!

I agree it can be polished, but that's not something the manpage can
fix... For now the above is correct.

So about the error retvals I reviewed the final manpage from git which
is the latest version.


> >>
> >> EINVAL The userfaultfd has already been enabled by a previous UFFâ
> >> DIO_API operation.
> >>
> >> EINVAL The API version requested in the api field is not supported by
> >> this kernel, or the features field was not zero.
> >>
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âFIXME â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âIn the above error case, the returned 'uffdio_api' â
> >> âstructure zeroed out. Why is this done? This should â
> >> âbe explained in the manual page. â
> >> â â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >
> > In my understanding the uffdio_api structure is zeroed to allow the caller
> > to distinguish the reasons for -EINVAL.
>
> Andrea, can you please help here?

It is zeroed out just for robustness, it's a slow path. If userland by
mistake won't check -EINVAL but it checks uffdio_api.features to be
set or uffdio_api.ioctls or uffdio_api.api after the UFFDIO_API ioctl
returns, it will have a chance to catch the failure (it won't risk to
parse random uninitialized values at least).

I don't think it should be documented the uffdio_api is zeroed out or
if it is documented, we should say userland shouldn't depend on it and
it's done just for robustness.

The normal correct way to catch an error is to check -EINVAL, after
getting -EINVAL the contents of uffdio_api should be ignored by
userland.

> >> UFFDIO_REGISTER
> >> (Since Linux 4.3.) Register a memory address range with the userâ
> >> faultfd object. The argp argument is a pointer to a uffdio_register
> >> structure, defined as:
> >>
> >> struct uffdio_range {
> >> __u64 start; /* Start of range */
> >> __u64 len; /* Length of rnage (bytes) */
> >> };
> >>
> >> struct uffdio_register {
> >> struct uffdio_range range;
> >> __u64 mode; /* Desired mode of operation (input) */
> >> __u64 ioctls; /* Available ioctl() operations (output) */
> >> };
> >>
> >>
> >> The range field defines a memory range starting at start and continuing
> >> for len bytes that should be handled by the userfaultfd.
> >>
> >> The mode field defines the mode of operation desired for this memory
> >> region. The following values may be bitwise ORed to set the userâ
> >> faultfd mode for the specified range:
> >>
> >> UFFDIO_REGISTER_MODE_MISSING
> >> Track page faults on missing pages.
> >>
> >> UFFDIO_REGISTER_MODE_WP
> >> Track page faults on write-protected pages.
> >>
> >> Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
> >>
> >> If the operation is successful, the kernel modifies the ioctls bit-mask
> >> field to indicate which ioctl(2) operations are available for the specâ
> >> ified range. This returned bit mask is as for UFFDIO_API.
> >>
> >> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> >> and errno is set to indicate the cause of the error. Possible errors
> >> include:
> >>
> >>
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âFIXME â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âIs the following error list correct? â
> >> â â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >
> > Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
> > And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
> > mm_struct has gone by the time userfault grabs it.
>
> Okay -- added EFAULT. I think I'll skip ENOMEM for the moment, but
> will note the possibility in the page source.

There is also ENOMEM as result of split_vma failing, and it isn't the
cleanest thing that it means both real OOM or out of vmas
(mm->map_count >= sysctl_max_map_count, not real OOM) and that the
process is exiting or there isn't a single vma in the mm.

If there's one vma but it isn't in the range we return -EINVAL so we
could return probably -EINVAL if it's exiting or if there isn't a
single vma in the mm. And leave -ENOMEM for split_vma only.

In general -EINVAL is programmer error of some kind, -ENOMEM is
returned in memory related cases that trigger at runtime, however if
the process is exiting it's debatable if it's programmer error too
which is why I think we could return -EINVAL there.

I'd expect userland to threat -ENOMEM and -EINVAL about the same way.

> >> EINVAL There was no mapping in the specified address range.
> >>
> >> UFFDIO_COPY
> >> (Since Linux 4.3.) Atomically copy a continuous memory chunk into the
> >> userfault registered range and optionally wake up the blocked thread.
> >> The source and destination addresses and the number of bytes to copy
> >> are specified by the src, dst, and len fields of the uffdio_copy strucâ
> >> ture pointed to by argp:
> >>
> >> struct uffdio_copy {
> >> __u64 dst; /* Source of copy */
> >> __u64 src; /* Destinate of copy */
> >> __u64 len; /* Number of bytes to copy */
> >> __u64 mode; /* Flags controlling behavior of copy */
> >> __s64 copy; /* Number of bytes copied, or negated error */
> >> };
> >>
> >> The following value may be bitwise ORed in mode to change the behavior
> >> of the UFFDIO_COPY operation:
> >>
> >> UFFDIO_COPY_MODE_DONTWAKE
> >> Do not wake up the thread that waits for page-fault resolution
> >>
> >> The copy field is used by the kernel to return the number of bytes that
> >> was actually copied, or an error (a negated errno-style value).
> >>
> >>
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âFIXME â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âAbove: Why is the 'copy' field used to return error â
> >> âvalues? This should be explained in the manual â
> >> âpage. â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >
> > Andrea, can you help with this one, please?
>
> Yes, Andrea, please.

Well not just for error values. copy also tells how much did it copy
if a signal made it return short with -EINTR or some other error
happened in the middle of the copy after we already did some
copying-progress.

Writing any other error into .copy (and not only writing positive
values there) is for robustness in case userland won't check the ioctl
retval but nobody should depend on it. After the ioctl returns an
error userland should not check the uffdio_copy structure at all.

The thing to document is the amount of bytes successfully copied in
uffdio_copy.copy (which may be a short copy and in turn must be
checked... unless one knows the len matches the arch PAGE_SIZE but
it's still safer to check the uffdio_copy.copy field and be
consistent).

One more relevant retvals for UFFDIO_COPY and UFFDIO_ZEROPAGE that I
noticed is missing in the current manpage, is
-EEXIST. UFFDIO_COPY/ZEROPAGE don't behave like mmap/mremap,
UFFDIO_COPY/ZEROPAGE will never teardown any existing established
mapping to guarantee even if the user has race condition in its code,
there's no risk of silent memory corruption, instead a meaningful
error is returned by UFFDIO_COPY/ZEROPAGE. For example if two
UFFDIO_COPY run concurrently on the same page, only the first one will
succeed, the second will return -EEXIST and only the first page will
be copied and no memory corruption can happen this way (furthermore
the programmer can be notified of the race condition in the code which
might even be intentional, but more likely is not).

> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âFIXME â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> >> âWhy is the 'zeropage' field used to return error â
> >> âvalues? This should be explained in the manual â
> >> âpage. â
> >> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> Help is still needed for this FIXME!

Same as uffdio_copy.copy: because we've to return the number of pages
that have been zeroed out in case we run into an error (signal -EINTR
or -EEXIST etc..) after we already succeeded partially on a couple of
pages. So we may as well write the error in the same field if no pages
were zeroed out. This way the program will misbehave more than if that
field is left untouched (and it could even be random in such case as
it can be left uninitialized by userland). Clearly it only makes a
difference if the programmer forgets to check the UFFDIO_ZEROPAGE
ioctl retval and the retval must always be checked, so again, this is
only for robustness.

Awesome manpage! Super useful, it's fundamental to have a manpage
especially when the syscall is not simple and strightforward in
functionality.

Thanks!
Andrea