Re: Review request: draft ioctl_userfaultfd(2) manual page

From: Mike Rapoport
Date: Wed Mar 22 2017 - 09:55:20 EST


Hello Michael,

On Mon, Mar 20, 2017 at 09:11:07PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Andrea, Mike, and all,
>
> Mike: here's the split out page that describes the
> userfaultfd ioctl() operations.
>
> I'd like to get review input, especially from you and
> Andrea, but also anyone else, for the current version
> of this page, which includes quite a few FIXMEs to be
> sorted.
>
> I've shown the rendered version of the page below.
> The groff source is attached, and can also be found
> at the branch here:
>
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/log/?h=draft_userfaultfd
>
> The new ioctl_userfaultfd(2) page follows this mail.
>
> Cheers,
>
> Michael
>
> NAME
> userfaultfd - create a file descriptor for handling page faults in user
> space
>
> SYNOPSIS
> #include <sys/ioctl.h>
>
> int ioctl(int fd, int cmd, ...);
>
> DESCRIPTION
> Various ioctl(2) operations can be performed on a userfaultfd object
> (created by a call to userfaultfd(2)) using calls of the form:
>
> ioctl(fd, cmd, argp);
>
> In the above, fd is a file descriptor referring to a userfaultfd
> object, cmd is one of the commands listed below, and argp is a pointer
> to a data structure that is specific to cmd.
>
> The various ioctl(2) operations are described below. The UFFDIO_API,
> UFFDIO_REGISTER, and UFFDIO_UNREGISTER operations are used to configure
> userfaultfd behavior. These operations allow the caller to choose what
> features will be enabled and what kinds of events will be delivered to
> the application. The remaining operations are range operations. These
> operations enable the calling application to resolve page-fault events
> in a consistent way.
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âAbove: What does "consistent" mean? â
> â â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Andrea, can you please help with this one?

> UFFDIO_API
> (Since Linux 4.3.) Enable operation of the userfaultfd and perform API
> handshake. The argp argument is a pointer to a uffdio_api structure,
> defined as:
>
> struct uffdio_api {
> __u64 api; /* Requested API version (input) */
> __u64 features; /* Must be zero */
> __u64 ioctls; /* Available ioctl() operations (output) */
> };
>
> The api field denotes the API version requested by the application.
> Before the call, the features field must be initialized to zero.
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âAbove: Why must the 'features' field be initialized â
> âto zero? â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Until 4.11 the only supported feature is delegation of missing page fault
and the UFFDIO_FEATURES bitmask is 0.
There's a check in uffdio_api call that the user is not trying to enable
any other functionality and it asserts that uffdio_api.featurs is zero [1].
Starting from 4.11 the features negotiation is different. Now uffdio_call
verifies that it can support features the application requested [2].


> The kernel verifies that it can support the requested API version, and
> sets the features and ioctls fields to bit masks representing all the
> available features and the generic ioctl(2) operations available. Curâ
> rently, zero (i.e., no feature bits) is placed in the features field.
> The returned ioctls field can contain the following bits:
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âThis user-space API seems not fully polished. Why â
> âare there not constants defined for each of the bit- â
> âmask values listed below? â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
>
> 1 << _UFFDIO_API
> The UFFDIO_API operation is supported.
>
> 1 << _UFFDIO_REGISTER
> The UFFDIO_REGISTER operation is supported.
>
> 1 << _UFFDIO_UNREGISTER
> The UFFDIO_UNREGISTER operation is supported.

Well, I tend to agree. I believe the original intention was to use the
OR'ed mask, like UFFD_API_IOCTLS.
Andrea, can you add somthing?

>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âIs the above description of the 'ioctls' field corâ â
> ârect? Does more need to be said? â
> â â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

This is correct. I wouldn't add anything else.

> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
G> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âIs the following error list correct? â
> â â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

There's also -EFAULT in case copy_{from,to}_user fails.

>
> EINVAL The userfaultfd has already been enabled by a previous UFFâ
> DIO_API operation.
>
> EINVAL The API version requested in the api field is not supported by
> this kernel, or the features field was not zero.
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âIn the above error case, the returned 'uffdio_api' â
> âstructure zeroed out. Why is this done? This should â
> âbe explained in the manual page. â
> â â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

In my understanding the uffdio_api structure is zeroed to allow the caller
to distinguish the reasons for -EINVAL.

> UFFDIO_REGISTER
> (Since Linux 4.3.) Register a memory address range with the userâ
> faultfd object. The argp argument is a pointer to a uffdio_register
> structure, defined as:
>
> struct uffdio_range {
> __u64 start; /* Start of range */
> __u64 len; /* Length of rnage (bytes) */
> };
>
> struct uffdio_register {
> struct uffdio_range range;
> __u64 mode; /* Desired mode of operation (input) */
> __u64 ioctls; /* Available ioctl() operations (output) */
> };
>
>
> The range field defines a memory range starting at start and continuing
> for len bytes that should be handled by the userfaultfd.
>
> The mode field defines the mode of operation desired for this memory
> region. The following values may be bitwise ORed to set the userâ
> faultfd mode for the specified range:
>
> UFFDIO_REGISTER_MODE_MISSING
> Track page faults on missing pages.
>
> UFFDIO_REGISTER_MODE_WP
> Track page faults on write-protected pages.
>
> Currently, the only supported mode is UFFDIO_REGISTER_MODE_MISSING.
>
> If the operation is successful, the kernel modifies the ioctls bit-mask
> field to indicate which ioctl(2) operations are available for the specâ
> ified range. This returned bit mask is as for UFFDIO_API.
>
> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âIs the following error list correct? â
> â â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Here again it maybe -EFAULT to indicate copy_{from,to}_user failure.
And, UFFDIO_REGISTER may return -ENOMEM if the process is exiting and the
mm_struct has gone by the time userfault grabs it.

> EBUSY A mapping in the specified range is registered with another
> userfaultfd object.
>
> EINVAL An invalid or unsupported bit was specified in the mode field;
> or the mode field was zero.
>
> EINVAL There is no mapping in the specified address range.
>
> EINVAL range.start or range.len is not a multiple of the system page
> size; or, range.len is zero; or these fields are otherwise
> invalid.
>
> EINVAL There as an incompatible mapping in the specified address range.
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âAbove: What does "incompatible" mean? â
> â â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Up to 4.10 userfault context may be registered only for MAP_ANONYMOUS |
MAP_PRIVATE mappings.

> UFFDIO_UNREGISTER
> (Since Linux 4.3.) Unregister a memory address range from userfaultfd.
> The address range to unregister is specified in the uffdio_range strucâ
> ture pointed to by argp.
>
> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
> EINVAL Either the start or the len field of the ufdio_range structure
> was not a multiple of the system page size; or the len field was
> zero; or these fields were otherwise invalid.
>
> EINVAL There as an incompatible mapping in the specified address range.
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âAbove: What does "incompatible" mean? â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

The same comments as for UFFDIO_REGISTER apply here as well.

> EINVAL There was no mapping in the specified address range.
>
> UFFDIO_COPY
> (Since Linux 4.3.) Atomically copy a continuous memory chunk into the
> userfault registered range and optionally wake up the blocked thread.
> The source and destination addresses and the number of bytes to copy
> are specified by the src, dst, and len fields of the uffdio_copy strucâ
> ture pointed to by argp:
>
> struct uffdio_copy {
> __u64 dst; /* Source of copy */
> __u64 src; /* Destinate of copy */
> __u64 len; /* Number of bytes to copy */
> __u64 mode; /* Flags controlling behavior of copy */
> __s64 copy; /* Number of bytes copied, or negated error */
> };
>
> The following value may be bitwise ORed in mode to change the behavior
> of the UFFDIO_COPY operation:
>
> UFFDIO_COPY_MODE_DONTWAKE
> Do not wake up the thread that waits for page-fault resolution
>
> The copy field is used by the kernel to return the number of bytes that
> was actually copied, or an error (a negated errno-style value).
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âAbove: Why is the 'copy' field used to return error â
> âvalues? This should be explained in the manual â
> âpage. â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Andrea, can you help with this one, please?

> If the value returned in copy doesn't match the value that was speciâ
> fied in len, the operation fails with the error EAGAIN. The copy field
> is output-only; it is not read by the UFFDIO_COPY operation.
>
> This ioctl(2) operation returns 0 on success. In this case, the entire
> area was copied. On error, -1 is returned and errno is set to indicate
> the cause of the error. Possible errors include:
>
> EAGAIN The number of bytes copied (i.e., the value returned in the copy
> field) does not equal the value that was specified in the len
> field.
>
> EINVAL Either dst or len was not a multiple of the system page size, or
> the range specified by src and len or dst and len was invalid.
>
> EINVAL An invalid bit was specified in the mode field.
>
> UFFDIO_ZEROPAGE
> (Since Linux 4.3.) Zero out a memory range registered with userâ
> faultfd. The requested range is specified by the range field of the
> uffdio_zeropage structure pointed to by argp:
>
> struct uffdio_zeropage {
> struct uffdio_range range;
> __u64 mode; /* Flags controlling behavior of copy */
> __s64 zeropage; /* Number of bytes zeroed, or negated error */
> };
>
> The following value may be bitwise ORed in mode to change the behavior
> of the UFFDIO_ZERO operation:
>
> UFFDIO_ZEROPAGE_MODE_DONTWAKE
> Do not wake up the thread that waits for page-fault resolution.
>
> The zeropage field is used by the kernel to return the number of bytes
> that was actually zeroed, or an error in the same manner as UFFâ
> DIO_COPY.
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âWhy is the 'zeropage' field used to return error â
> âvalues? This should be explained in the manual â
> âpage. â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> If the value returned in the zeropage field doesn't match the value
> that was specified in range.len, the operation fails with the error
> EAGAIN. The zeropage field is output-only; it is not read by the UFFâ
> DIO_ZERO operation.
>
> This ioctl(2) operation returns 0 on success. In this case, the entire
> area was zeroed. On error, -1 is returned and errno is set to indicate
> the cause of the error. Possible errors include:
>
> EAGAIN The number of bytes zeroed (i.e., the value returned in the
> zeropage field) does not equal the value that was specified in
> the range.len field.
>
> EINVAL Either range.start or range.len was not a multiple of the system
> page size; or range.len was zero; or the range specified was
> invalid.
>
> EINVAL An invalid bit was specified in the mode field.
>
> UFFDIO_WAKE
> (Since Linux 4.3.) Wake up the thread waiting for page-fault resoluâ
> tion on a specified memory address range. The argp argument is a
> pointer to a uffdio_range structure (shown above) that specifies the
> address range.
>
>
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âFIXME â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ
> âNeed more detail here. How is the UFFDIO_WAKE operaâ â
> âtion used? â
> âââââââââââââââââââââââââââââââââââââââââââââââââââââââ

The UFFDIO_WAKE operation is used in conjunction with
UFFDIO_{COPY,ZEROPAGE} operations that have
UFFDIO_{COPY,ZEROPAGE}_MODE_DONTWAKE bit set in the mode field.
The userfault monitor can perform several UFFDIO_{COPY,ZEROPAGE} calls in a
batch and then explicitly wake up the faulting thread using UFFDIO_WAKE.

> This ioctl(2) operation returns 0 on success. On error, -1 is returned
> and errno is set to indicate the cause of the error. Possible errors
> include:
>
> EINVAL The start or the len field of the ufdio_range structure was not
> a multiple of the system page size; or len was zero; or the
> specified range was otherwise invalid.
>
> RETURN VALUE
> See descriptions of the individual operations, above.
>
> ERRORS
> See descriptions of the individual operations, above. In addition, the
> following general errors can occur for all of the operations described
> above:
>
> EFAULT argp does not point to a valid memory address.
>
> EINVAL (For all operations except UFFDIO_API.) The userfaultfd object
> has not yet been enabled (via the UFFDIO_API operation).
>
> CONFORMING TO
> These ioctl(2) operations are Linux-specific.
>
> EXAMPLE
> See userfaultfd(2).
>
> SEE ALSO
> ioctl(2), mmap(2), userfaultfd(2)
>
> Documentation/vm/userfaultfd.txt in the Linux kernel source tree
>

[1] http://lxr.free-electrons.com/source/fs/userfaultfd.c#L1199
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/userfaultfd.c#n1680