Re: [PATCH RFC 3/3] openat2.2: document new openat2(2) syscall
From: Aleksa Sarai
Date: Wed Oct 09 2019 - 06:17:56 EST
On 2019-10-09, Michael Kerrisk (man-pages) <mtk.manpages@xxxxxxxxx> wrote:
> Hello Aleksa,
>
> Thanks for this. It's a great piece of documentation work!
>
> I would prefer the path_resolution(7) piece as a separate patch.
Thanks, and will do.
> On 10/3/19 4:55 PM, Aleksa Sarai wrote:
> > Rather than trying to merge the new syscall documentation into open.2
> > (which would probably result in the man-page being incomprehensible),
> > instead the new syscall gets its own dedicated page with links between
> > open(2) and openat2(2) to avoid duplicating information such as the list
> > of O_* flags or common errors.
>
> Yes, looking at the size of the proposed openat2(2) page,
> this seems best.
> >
> > Signed-off-by: Aleksa Sarai <cyphar@xxxxxxxxxx>
> > ---
> > man2/open.2 | 5 +
> > man2/openat2.2 | 381 +++++++++++++++++++++++++++++++++++++++++
> > man7/path_resolution.7 | 57 ++++--
> > 3 files changed, 426 insertions(+), 17 deletions(-)
> > create mode 100644 man2/openat2.2
> >
> > diff --git a/man2/open.2 b/man2/open.2
> > index 7217fe056e5e..a0b43394bbee 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -65,6 +65,10 @@ open, openat, creat \- open and possibly create a file
> > .BI "int openat(int " dirfd ", const char *" pathname ", int " flags );
> > .BI "int openat(int " dirfd ", const char *" pathname ", int " flags \
> > ", mode_t " mode );
> > +.PP
> > +/* Docuented separately, in \fBopenat2\fP(2). */
>
> Documented
>
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> > .fi
> > .PP
> > .in -4n
> > @@ -1808,6 +1812,7 @@ will create a regular file (i.e.,
> > .B O_DIRECTORY
> > is ignored).
> > .SH SEE ALSO
> > +.BR openat2 (2),
>
> Entries here should into alphabetical order (within
> sections).
>
> > .BR chmod (2),
> > .BR chown (2),
> > .BR close (2),
> > diff --git a/man2/openat2.2 b/man2/openat2.2
> > new file mode 100644
> > index 000000000000..c43c76046243
> > --- /dev/null
> > +++ b/man2/openat2.2
> > @@ -0,0 +1,381 @@
> > +.\" Copyright (C) 2019 Aleksa Sarai <cyphar@xxxxxxxxxx>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein. The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.TH OPENAT2 2 2019-10-03 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +openat2 \- open and possibly create a file (extended)
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/stat.h>
> > +.B #include <fcntl.h>
> > +.PP
> > +.BI "int openat2(int " dirfd ", const char *" pathname ", \
> > +const struct open_how *" how ", size_t " size ");
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call; see NOTES.
> > +.SH DESCRIPTION
> > +The
> > +.BR openat2 ()
> > +system call is an extension of
> > +.BR openat (2)
> > +and provides a superset of its functionality. Rather than taking a single
>
> Please start new sentences on new source lines. I recently added this
> text in man-pages(7):
>
> Use semantic newlines
> In the source of a manual page, new sentences should be started on
> new lines, and long sentences should split into lines at clause
> breaks (commas, semicolons, colons, and so on). This convention,
> sometimes known as "semantic newlines", makes it easier to see the
> effect of patches, which often operate at the level of individual
> sentences or sentence clauses.
>
> > +.I flag
> > +argument, an extensible structure (\fIhow\fP) is passed instead to allow for
> > +seamless future extensions.
>
> s/seamless//
>
> > +.PP
> > +.I size
> > +must be set to
> > +.IR "sizeof(struct open_how)" ,
> > +to facilitate future extensions (see the "Extensibility" section of the
> > +\fBNOTES\fP for more detail on how extensions are handled.)
> > +
> > +.SS The open_how structure
> > +The following structure indicates how
> > +.I pathname
> > +should be opened, and acts as a superset of the
> > +.IR flag " and " mode
> > +arguments to
> > +.BR openat (2).
> > +.PP
> > +.in +4n
> > +.EX
> > +struct open_how {
> > + uint32_t flags; /* open(2)-style O_* flags. */
> > + union {
> > + uint16_t mode; /* File mode bits for new file creation. */
> > + uint16_t upgrade_mask; /* Restrict how O_PATHs may be re-opened. */
> > + };
> > + uint32_t resolve; /* RESOLVE_* path-resolution flags. */
> > +};
> > +.EE
> > +.in
> > +.PP
> > +Any future extensions to
> > +.BR openat2 ()
> > +will be implemented as new fields appended to the above structure, with the
> > +zero value of the new fields acting as though the extension were not present.
> > +.PP
> > +The meaning of each field is as follows:
> > +.RS
> > +
> > +.I flags
> > +.RSall
> > +The file creation and status flags to use for this operation. All of the
> > +.B O_*
> > +flags defined for
> > +.BR openat (2)
> > +are valid
> > +.BR openat2 ()
> > +flag values.
> > +.RE
> > +
> > +.I upgrade_mask
> > +.RS
> > +Restrict with which
> > +.I access modes
> > +the returned
> > +.B O_PATH
> > +descriptor may be re-opened (either through
> > +.B O_EMPTYPATH
> > +or
> > +.IR /proc/self/fd/ .)
> > +This field may only be set to a non-zero value if
> > +.I flags
> > +contains
> > +.BR O_PATH .
> > +By default, an
> > +.B O_PATH
> > +file descriptor of an ordinary file may be re-opened with with any access mode (but an
> > +.B O_PATH
> > +file descriptor of a magic-link may only be re-opened with access modes that
> > +the original magic-link possessed). The full list of
>
> magic link (throughout the page)
>
> > +.I upgrade_mask
> > +flags is given below.
> > +.TP
> > +.B UPGRADE_NOREAD
> > +Do not permit the
> > +.B O_PATH
> > +file descriptor to be re-opened for reading (i.e.
> > +.BR O_RDONLY " or " O_RDWR .)
> > +.TP
> > +.B UPGRADE_NOWRITE
> > +Do not permit the
> > +.B O_PATH
> > +file descriptor to be re-opened for writing (i.e.
> > +.BR O_WRONLY ", " O_RDWR ", or " O_APPEND .)
> > +.RE
> > +.I resolve
> > +.RS
> > +Change how the components of
> > +.I pathname
> > +will be resolved (see
> > +.BR path_resolution (7)
> > +for background information.) The primary use-case for these flags is to allow
>
> use case
>
> > +trusted programs to restrict how un-trusted paths (or paths inside un-trusted
>
> untrusted
>
> > +directories) are resolved. The full list of
> > +.I resolve
> > +flags is given below.
> > +.TP
> > +.B RESOLVE_NO_XDEV
> > +Disallow all mount-point crossings during path resolution (including
>
> I think better would be: "Disallow traversal of mount points". Do you
> agree?
Yes, that sounds better.
> > +all bind-mounts).
>
> bind mounts
>
> > +
> > +Users of this flag are encouraged to make its use configurable (unless it is
> > +used for a specific security purpose), as bind-mounts are very widely used by
> > +end-users and thus enabling this flag globally may result in spurious errors on
> > +some systems.
> > +.TP
> > +.B RESOLVE_NO_SYMLINKS
> > +Disallow all symlink resolution during path resolution. If the trailing
>
> Disallow resolution of symbolic links during path resolution
>
> > +component is a symlink, and
>
> symbolic link (throughout the page)
>
> > +.I flags
> > +contains both
> > +.BR O_PATH " and " O_NOFOLLOW ","
> > +then an
> > +.B O_PATH
> > +file descriptor referencing the symlink will be returned. This option implies
> > +.BR RESOLVE_NO_MAGICLINKS .
> > +
> > +Users of this flag are encouraged to make its use configurable (unless it is
> > +used for a specific security purpose), as symlinks are very widely used by
> > +end-users and thus enabling this flag globally may result in spurious errors on
> > +some systems.
>
> It's not really clear what you mean by "enabling this flag globally".
> Could you reword, or explain in a bit more detail?
A better word might be "indiscriminately" -- the point being that if
a program uses it for every openat2() call (and users cannot disable
it), then the program will break on all sorts of systems.
> > +.TP
> > +.B RESOLVE_NO_MAGICLINKS
> > +Disallow all magic-link resolution during path resolution. If the trailing
> > +component is a magic-link, and
> > +.I flags
> > +contains both
> > +.BR O_PATH " and " O_NOFOLLOW ","
> > +then an
> > +.B O_PATH
> > +file descriptor referencing the magic-link will be returned.
> > +
> > +Magic-links are symlink-like objects that are most notably found in
> > +.BR proc (5)
> > +(examples include
> > +.IR /proc/[pid]/exe " and " /proc/[pid]/fd/* .)
> > +Due to the potential danger of unknowingly opening these magic-links, it may be
> > +preferable for users to disable their resolution entirely (see
> > +.BR symlink (7)
> > +for more details.)
> > +.TP
> > +.B RESOLVE_BENEATH
> > +Do not permit the path resolution to succeed if any component of the resolution
> > +is not a descendant of the directory indicated by
> > +.IR dirfd .
> > +This results in absolute symlinks (and absolute values of
> > +.IR pathname )
> > +to be rejected. Magic-link resolution is also not permitted.
>
> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
> it would be good to state that more explicitly,
It does, though this might change in the future (some magic-link
resolutions might be safe -- but it's unclear what the semantics should
be). Users should explicitly set RESOLVE_NO_MAGICLINKS if they really
don't want to resolve them.
> > +
> > +.TP
> > +.B RESOLVE_IN_ROOT
> > +Temporarily treat
> > +.I dirfd
> > +as the root of the filesystem (as though the user called
>
> Perhaps better:
>
> Treat
> .I dirfd
> as the root directory while resolving
> .I pathname
> (as though...)
Yeah that sounds better.
> > +.BR chroot (2)
> > +with
> > +.IR dirfd
> > +as the argument.) Absolute symlinks and ".." path components will be scoped to
> > +.IR dirfd . Magic-link resolution is also not permitted.
>
> Insert a newline before "Magic" to fix a formatting problem.
>
> So, this flag implies RESOLVE_NO_MAGICLINKS? If yes,
> it would be good to state that more explicitly,
Same reply as above.
> > +
> > +However, unlike
> > +.BR chroot (2)
> > +(which changes the filesystem root persistently for an entire thread-group),
>
> s/persistently for an entire thread-group/
> /permanently for a process/
>
> > +.B RESOLVE_IN_ROOT
> > +allows a program to efficiently restrict path resolution for only certain
> > +operations. It also has several hardening features (such as not permitting
> > +magic-link resolution) which
> > +.BR chroot (2)
> > +does not.
> > +.RE
> > +
> > +.RE
> > +
> > +.PP
> > +Unlike
> > +.BR openat (2),
> > +any unknown flags set in fields of
> > +.I how
> > +will result in an error, rather than being ignored.
>
> Thank you, thank you, thank you. It was sad
> that openat() never fixed that antifeature.
No problem, it's bothered me for a long time as well. :D
> > In addition, an error will
> > +be returned if the value of the
> > +.IR mode " and " upgrade_mask
> > +union is non-zero unless:
> > +.RS
> > +.IP * 3
> > +.I flags
> > +indicates that a new file will be created (it contains
> > +.BR O_CREAT " or " O_TMPFILE ),
> > +in which case
> > +.I mode
> > +may be any valid file mode.
> > +.IP *
> > +.I flags
> > +contains
> > +.BR O_PATH ,
> > +in which case
> > +.I upgrade_mask
> > +must only contain valid
> > +.B UPGRADE_*
> > +flags.
> > +.RE
> > +
> > +.SH RETURN VALUE
> > +On success, a new file descriptor is returned. On error, -1 is returned, and
> > +.I errno
> > +is set appropriately.
> > +
> > +.SH ERRORS
> > +The set of errors returned by
> > +.BR openat2 ()
> > +includes all of the errors returned by
> > +.BR openat (2),
> > +as well as the following additional errors:
> > +.TP
> > +.B EINVAL
> > +An unknown flag or invalid value was specified in
> > +.IR how .
> > +.TP
> > +.B EINVAL
> > +.I size
> > +was smaller than any known version of
> > +.IR "struct open_how" .
> > +.TP
> > +.B E2BIG
> > +An extension was specified in
> > +.IR how ,
> > +which the current kernel does not support (see the "Extensibility" section of
> > +the \fBNOTES\fP for more detail on how extensions are handled.)
> > +.TP
> > +.B EAGAIN
> > +.I resolve
> > +contains either
> > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> > +and the kernel could not ensure that a ".." component didn't escape (due to a
> > +race condition or potential attack). Callers may choose to retry the
> > +.BR openat2 ()
> > +call.
> > +.TP
> > +.B EXDEV
> > +.I resolve
> > +contains either
> > +.BR RESOLVE_IN_ROOT " or " RESOLVE_BENEATH ,
> > +and a path component attempted to escape the root of the resolution.
> > +
> > +.TP
> > +.B EXDEV
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_XDEV ,
> > +and a path component attempted to cross a mount-point.
>
> mount point
>
> > +
> > +.TP
> > +.B ELOOP
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_SYMLINKS ,
> > +and one of the path components was a symlink.
> > +.TP
> > +.B ELOOP
> > +.I resolve
> > +contains
> > +.BR RESOLVE_NO_MAGICLINKS ,
> > +and one of the path components was a magic-link.
> > +
> > +.SH VERSIONS
> > +.BR openat2 ()
> > +was added to Linux in kernel 5.FOO.
> > +
> > +.SH CONFORMING TO
> > +This system call is Linux-specific.
> > +
> > +The semantics of
> > +.B RESOLVE_BENEATH
> > +were modelled after FreeBSD's
> > +.BR O_BENEATH .
> > +
> > +.SH NOTES
> > +Glibc does not provide a wrapper for this system call; call it using
> > +.BR syscall (2).
> > +
> > +.SS Extensibility
> > +In order to allow for
> > +.I struct open_how
> > +to be extended in future kernel revisions,
> > +.BR openat2 ()
> > +requires userspace to specify what sized
>
> s/what sized/the size of/
>
> > +.I struct open_how
> > +structure they are passing. By providing this information, it is possible for
> > +.BR openat2 ()
> > +to provide both forwards- and backwards-compatibility \(em with
> > +.I size
> > +acting as an implicit version number (because new extension fields will always
> > +be appended, the size will always increase.) This extensibility design is very
> > +similar to other system calls such as
> > +.BR perf_setattr "(2), " perf_event_open "(2), and " clone (3).
>
> The following explantion of uszie and ksize is great. Thanks for that.
Glad to hear you don't think it's too much fluff. :D
> > +If we let
> > +.I usize
> > +be the size of the structure according to userspace and
> > +.I ksize
> > +be the size of the structure which the kernel supports, then there are only
> > +three cases to consider:
> > +
> > +.RS
> > +.IP * 3
> > +If
> > +.IR ksize " equals " usize ,
> > +then there is no version mismatch and
> > +.I how
> > +can be used verbatim.
> > +.IP *
> > +If
> > +.IR ksize " is larger than " usize ,
> > +then there are some extensions the kernel supports which the userspace program
> > +is unaware of. Because all extensions must have their zero values be a no-op,
> > +the kernel treats all of the extension fields not set by userspace to have zero
> > +values. This provides backwards-compatibility.
> > +.IP *
> > +If
> > +.IR ksize " is smaller than " usize ,
> > +then there are some extensions which the userspace program is aware of but the
> > +kernel does not support. Because all extensions must have their zero values be
> > +a no-op, the kernel can safely ignore the unsupported extension fields if they
> > +are all-zero. If any unsupported extension fields are non-zero, then an error
> > +is returned. This provides forwards-compatibility.
> > +.RE
> > +
> > +Therefore, most userspace programs will not need to have any special handling
> > +of extensions. However, if a userspace program wishes to determine what
> > +extensions the running kernel supports, they may conduct a binary search on
> > +.IR size
> > +(to find the largest value which doesn't produce an error.)
> > +
> > +.SH SEE ALSO
> > +.BR openat (2),
> > +.BR path_resolution (7),
> > +.BR symlink (7)
> > diff --git a/man7/path_resolution.7 b/man7/path_resolution.7
> > index 85dd354e9a93..3da3e5b614c8 100644
> > --- a/man7/path_resolution.7
> > +++ b/man7/path_resolution.7
> > @@ -29,17 +29,17 @@ path_resolution \- how a pathname is resolved to a file
> > Some UNIX/Linux system calls have as parameter one or more filenames.
> > A filename (or pathname) is resolved as follows.
> > .SS Step 1: start of the resolution process
> > -If the pathname starts with the \(aq/\(aq character,
> > -the starting lookup directory
> > -is the root directory of the calling process.
> > -(A process inherits its
> > -root directory from its parent.
> > -Usually this will be the root directory
> > -of the file hierarchy.
> > -A process may get a different root directory
> > -by use of the
> > +If the pathname starts with the \(aq/\(aq character, the starting lookup
> > +directory is the root directory of the calling process. (A process inherits its
> > +root directory from its parent. Usually this will be the root directory of the
> > +file hierarchy. A process may get a different root directory by use of the
> > .BR chroot (2)
> > -system call.
> > +system call, or may temporarily use a different root directory by using
> > +.BR openat2 (2)
> > +with the
> > +.B RESOLVE_IN_ROOT
> > +flag set.
> > +.PP
> > A process may get an entirely private mount namespace in case
> > it\(emor one of its ancestors\(emwas started by an invocation of the
> > .BR clone (2)
> > @@ -48,16 +48,24 @@ system call that had the
> > flag set.)
> > This handles the \(aq/\(aq part of the pathname.
> > .PP
> > -If the pathname does not start with the \(aq/\(aq character, the
> > -starting lookup directory of the resolution process is the current working
> > -directory of the process.
> > -(This is also inherited from the parent.
> > -It can be changed by use of the
> > +If the pathname does not start with the \(aq/\(aq character, the starting
> > +lookup directory of the resolution process is the current working directory of
> > +the process \(em or in the case of
> > +.BR openat (2)-style
> > +syscalls, the
>
> system calls
>
> > +.I dfd
> > +argument (or the current working directory if
> > +.B AT_FDCWD
> > +is passed as the
> > +.I dfd
> > +argumnet). The current working directory is inherited from the parent, and can
>
> argument
>
> > +be changed by use of the
> > .BR chdir (2)
> > -system call.)
> > +syscall.
>
> "system call" please.
>
> > .PP
> > Pathnames starting with a \(aq/\(aq character are called absolute pathnames.
> > Pathnames not starting with a \(aq/\(aq are called relative pathnames.
> > +
>
> No blank line here.
>
> > .SS Step 2: walk along the path
> > Set the current lookup directory to the starting lookup directory.
> > Now, for each nonfinal component of the pathname, where a component
> > @@ -124,6 +132,13 @@ the kernel's pathname-resolution code
> > was reworked to eliminate the use of recursion,
> > so that the only limit that remains is the maximum of 40
> > resolutions for the entire pathname.
> > +.PP
> > +The resolution of syscalls during this stage can be blocked by using
>
> "resolution of syscall" seems wrong? "syscall" should be something
> else?
Yeah, should be "resolution of symlinks". ;)
> > +.BR openat2 (2),
> > +with the
> > +.B RESOLVE_NO_SYMLINKS
> > +flag set.
> > +
> > .SS Step 3: find the final entry
> > The lookup of the final component of the pathname goes just like
> > that of all other components, as described in the previous step,
> > @@ -160,7 +175,8 @@ The path resolution process will assume that these entries have
> > their conventional meanings, regardless of whether they are
> > actually present in the physical filesystem.
> > .PP
> > -One cannot walk down past the root: "/.." is the same as "/".
> > +One cannot walk up past the root: "/.." is the same as "/".
> > +
>
> No blank line please.
>
> > .SS Mount points
> > After a "mount dev path" command, the pathname "path" refers to
> > the root of the filesystem hierarchy on the device "dev", and no
> > @@ -169,6 +185,13 @@ longer to whatever it referred to earlier.
> > One can walk out of a mounted filesystem: "path/.." refers to
> > the parent directory of "path",
> > outside of the filesystem hierarchy on "dev".
> > +.PP
> > +Mount-point crossings can be blocked by using
>
> Traversal of mount points can be disallowed by...
>
> > +.BR openat2 (2),
> > +with the
> > +.B RESOLVE_NO_XDEV
> > +flag set (though note that this also restricts bind-mount crossings).
> > +
>
> No blank line please.
>
> > .SS Trailing slashes
> > If a pathname ends in a \(aq/\(aq, that forces resolution of the preceding
> > component as in Step 2: it has to exist and resolve to a directory.
> >
Thanks so much, and I'll clean up your nits.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
Attachment:
signature.asc
Description: PGP signature