Re: [PATCH] readfile.2: new page describing readfile(2)

From: Heinrich Schuchardt
Date: Sat Jul 04 2020 - 22:54:44 EST


On 7/4/20 4:02 PM, Greg Kroah-Hartman wrote:
> readfile(2) is a new syscall to remove the need to do the
> open/read/close dance for small virtual files in places like procfs or
> sysfs.
>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> ---
>
> This patch is for the man-pages project, not the kernel source tree
>
> man2/readfile.2 | 159 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 159 insertions(+)
> create mode 100644 man2/readfile.2
>
> diff --git a/man2/readfile.2 b/man2/readfile.2
> new file mode 100644
> index 000000000000..449e722c3442
> --- /dev/null
> +++ b/man2/readfile.2
> @@ -0,0 +1,159 @@
> +.\" This manpage is Copyright (C) 2020 Greg Kroah-Hartman;
> +.\" and Copyright (C) 2020 The Linux Foundation
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH READFILE 2 2020-07-04 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +readfile \- read a file into a buffer
> +.SH SYNOPSIS
> +.nf
> +.B #include <unistd.h>
> +.PP
> +.BI "ssize_t readfile(int " dirfd ", const char *" pathname ", void *" buf \
> +", size_t " count ", int " flags );
> +.fi
> +.SH DESCRIPTION
> +.BR readfile ()
> +attempts to open the file specified by
> +.IR pathname
> +and to read up to
> +.I count
> +bytes from the file into the buffer starting at
> +.IR buf .
> +It is to be a shortcut of doing the sequence of

Just my personal preference for concise language:

It replaces the sequence of

> +.BR open ()
> +and then

%s/and then /, /

> +.BR read ()
> +and then

%s/and then/, and/

> +.BR close ()
> +for small files that are read frequently, such as those in

". It reduces system call overheads especially for small files, like
those in"

readfile() makes sense even if each individual file is only read once,
not frequently.

Below you describe that file up to 2GiB can be read. So readfile() seems
to be a shortcut for larger files too.

> +.B procfs
> +or
> +.BR sysfs .

Executing readfile() generates the same file notification events as said
individual calls (cf. fanotify.7, inotify.7).

> +.PP
> +If the size of file is smaller than the value provided in
> +.I count
> +then the whole file will be copied into
> +.IR buf .
> +.PP
> +If the file is larger than the value provided in
> +.I count
> +then only
> +.I count
> +number of bytes will be copied into
> +.IR buf .
> +.PP
> +The argument
> +.I flags
> +may contain one of the following
> +.IR "access modes" :
> +.BR O_NOFOLLOW ", or " O_NOATIME .
> +.PP
> +If the pathname given in
> +.I pathname
> +is relative, then it is interpreted relative to the directory
> +referred to by the file descriptor
> +.IR dirfd .
> +.PP
> +If
> +.I pathname
> +is relative and
> +.I dirfd
> +is the special value
> +.BR AT_FDCWD ,
> +then
> +.I pathname
> +is interpreted relative to the current working
> +directory of the calling process (like
> +.BR openat ()).
> +.PP
> +If
> +.I pathname
> +is absolute, then
> +.I dirfd
> +is ignored.

readfile() blocks until either the whole file has been read, the buffer
is completely filled, or the system specific limit (see below) has been
reached.

> +.SH RETURN VALUE
> +On success, the number of bytes read is returned.
> +It is not an error if this number is smaller than the number of bytes
> +requested; this can happen if the file is smaller than the number of
> +bytes requested.

"It is not an error ..." is very vague. Are there any other cases where
a file is only partially read and the number of bytes returned is less
then the minimum of buffer size and file size? How would I discover
truncation?

Or can I rely on the call returning either an error or said minimum
number of bytes? In the latter case:

"When reading from a block device this always equals the minimum of the
buffer size specified by 'count', the file size, and the system specific
limit for read.2 calls (see below)."

> +.PP
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EFAULT
> +.I buf
> +is outside your accessible address space.
> +.TP
> +.B EINTR
> +The call was interrupted by a signal before any data was read; see
> +.BR signal (7).
> +.TP
> +.B EINVAL
> +.I flags
> +was set to a value that is not allowed.
> +.TP
> +.B EIO
> +I/O error.
> +This will happen for example when the process is in a
> +background process group, tries to read from its controlling terminal,
> +and either it is ignoring or blocking

Can we copy the description from read.2 which gives more information or
refer to it?

> +.B SIGTTIN
> +or its process group
> +is orphaned.
> +It may also occur when there is a low-level I/O error
> +while reading from a disk or tape.
> +A further possible cause of
> +.B EIO
> +on networked filesystems is when an advisory lock had been taken
> +out on the file descriptor and this lock has been lost.
> +See the
> +.I "Lost locks"
> +section of
> +.BR fcntl (2)
> +for further details.

EPERM is missing in this section. Cf. fanotify.7.

Best regards

Heinrich

> +.SH CONFORMING TO
> +None, this is a Linux-specific system call at this point in time.
> +.SH NOTES
> +The type
> +.I size_t
> +is an unsigned integer data type specified by POSIX.1.
> +.PP
> +On Linux,
> +.BR read ()
> +(and similar system calls) will transfer at most
> +0x7ffff000 (2,147,479,552) bytes,
> +returning the number of bytes actually transferred.
> +.\" commit e28cc71572da38a5a12c1cfe4d7032017adccf69
> +(This is true on both 32-bit and 64-bit systems.)
> +.SH BUGS
> +None yet!
> +.SH SEE ALSO
> +.BR close (2),
> +.BR open (2),
> +.BR openat (2),
> +.BR read (2),
> +.BR fread (3)
>