Re: [PATCH] fs: add fincore(2) (mincore(2) for file descriptors)

From: Chris Frost
Date: Tue Feb 16 2010 - 13:21:49 EST


Add the fincore() system call. fincore() is mincore() for file descriptors.

The functionality of fincore() can be emulated with an mmap(), mincore(),
and munmap(), but this emulation requires more system calls and requires
page table modifications. fincore() can provide a significant performance
improvement for non-sequential in-core queries.

CC: Andi Kleen <andi@xxxxxxxxxxxxxx>
CC: Wu Fengguang <fengguang.wu@xxxxxxxxx>
CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Chris Frost <frost@xxxxxxxxxxx>
Signed-off-by: Steve VanDeBogart <vandebo@xxxxxxxxxxx>
---

Andi, Wu, and Andrew,

Thanks for the feedback. I have incorporated most of your suggestions
into this patch. A man page for fincore(2) is attached.

A few questions about the suggestions:

* Early return when a signal is queued
Andi pointed out that it would be good to return early if a signal is
queued for the calling process. I see two options to do this:
1. Return -EINTR (the patch in this email takes this approach).
2. Return the results discovered so far.

And the tradeoffs:
1. Returning -EINTR abandons good work.
2. fincore needs to inform the caller of the number of valid vec entries.
Three approaches:
- Return the number of file bytes with valid page entries.
Issue: The return type is a long. For a 32 bit process, sizeof(long)
is significantly less than sizeof(loff_t).
- Return the number of pages with valid page entries (use type ssize_t).
Issue: I don't see any significant issues.
- Add an 'loff_t *result' parameter to the system call (a la _llseek).
Issue: does this push the number of arguments too high?

Given the above, I feel that returning the number of valid page entries is
the best approach. Feedback?

* Radix gang lookup
Andi and Andrew suggested using radix gang lookups for larger ranges.
My benchmarks show that my change to do this isn't a strict performance
win. Attached is my patch that changes the below code to do gang lookups.
Here are the microbenchmark results for querying a 1 GiB file (results
are times in seconds):
density | none in-core | many in-core |
pages/syscall | 1 | 8192 | 1 | 8192 |
--------------+------+-------+------+-------|
fincore | .055 | .0045 | .078 | .0250 |
fincore-batch | .059 | .0010 | .123 | .0170 |
"none in-core" means none of the file pages were in ram and "many in-core"
that the majority were in ram. the next row indicates how many pages
are queried in each fincore(2) call by the process (1 or 8192 pages).
These results show that fincore-batch is ~2-5x faster for large count
calls, but that it is slower for small count calls. Both classes of calls
are made by our macrobenchmarks. The macrobenchmarks (a SQLite query and
a GIMP image blur) show the batch version slows performance by 2-12%.

Perhaps my gang patch can be further optimized, or is incorrect? (Of note,
I did not avoid the gets and puts.) If not, fincore_pages() could be
changed to decide choose which of a single vs. gang lookup to do based on
a heuristic tradeoff value. Or it could stay simple and only make single
lookups.

* Kernel output buffer size
Andi pointed out that the optimization to avoid allocating a buffer page
for small queries could be changed to claim more than 64 bytes of on-stack
buffer space. I haven't seen increasing this space to significantly affect
my particular macrobenchmarks. I'm ok with leaving it at 64 bytes or
changing it; this number was picked fairly arbitrarily.

* Upper query size limit
Andi suggested imposing a reasonable upper limit on the number of pages
that can be queried in a single fincore call, because the process would
not be debuggable while the system call is executing. Andi, could you
cite an example system call for me to look at that takes this approach?

* System call argument feasibility across architectures
The fincore(2) parameters are now reordered so that they should fit in six
registers on all platforms. I intend for the libc wrapper to expose the
call as shown in the man page, rather than the system call's ordering.
I think this issue is now resolved, but wanted to mention it just in case
I missed anything; does this look good for all architectures?

include/linux/syscalls.h | 2
fs/fincore.c | 126 +++++++++++++++++++++++++++++++++++++
fs/Makefile | 2
include/asm-generic/unistd.h | 5 +
arch/x86/ia32/ia32entry.S | 1
arch/x86/include/asm/unistd_32.h | 3
arch/x86/include/asm/unistd_64.h | 2
arch/x86/kernel/syscall_table_32.S | 1
8 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a990ace..814e4f5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -534,6 +534,8 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
+asmlinkage long sys_fincore(unsigned int fd, unsigned char __user *vec,
+ loff_t start, loff_t len);

asmlinkage long sys_pivot_root(const char __user *new_root,
const char __user *put_old);
diff --git a/fs/fincore.c b/fs/fincore.c
new file mode 100644
index 0000000..f329fe4
--- /dev/null
+++ b/fs/fincore.c
@@ -0,0 +1,126 @@
+/*
+ * fs/fincore.c
+ *
+ * Copyright (C) 2009, 2010 Chris Frost, UC Regents
+ * Copyright (C) 2008 Steve VanDeBogart, UC Regents
+ */
+
+/*
+ * The fincore() system call.
+ */
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+
+static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
+{
+ unsigned char present = 0;
+ struct page *page = find_get_page(mapping, pgoff);
+ if (page) {
+ present = PageUptodate(page);
+ page_cache_release(page);
+ }
+
+ return present;
+}
+
+/*
+ * The fincore(2) system call.
+ *
+ * fincore() returns the memory residency status of the pages backing
+ * a file range specified by fd and [start, start + len).
+ * The status is returned in a vector of bytes. The least significant
+ * bit of each byte is 1 if the referenced page is in memory, otherwise
+ * it is zero.
+ *
+ * Because the status of a page can change after fincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information. Only locked pages are guaranteed to
+ * remain in memory.
+ *
+ * Note that the parameter order for this system calls differ from the order
+ * for the libc wrapper. This syscall order allows the parameters to fit
+ * in six registers on all architectures.
+ *
+ * return values:
+ * zero - success
+ * -EBADF - fd is an illegal file descriptor
+ * -EFAULT - vec points to an illegal address
+ * -EINVAL - start is not a multiple of PAGE_CACHE_SIZE
+ * -EAGAIN - A kernel resource was temporarily unavailable
+ * -EINTR - The call was interrupted by a signal
+ */
+SYSCALL_DEFINE4(fincore, unsigned int, fd, unsigned char __user *, vec,
+ loff_t, start, loff_t, len)
+{
+ long retval;
+ pgoff_t pgoff = start >> PAGE_SHIFT;
+ pgoff_t npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ pgoff_t pgend = pgoff + npages;
+ struct file *filp;
+ int fput_needed;
+ loff_t file_nbytes;
+ pgoff_t file_npages;
+ unsigned char *kernel_vec = NULL;
+ unsigned char kernel_vec_small[64];
+ unsigned kernel_vec_count;
+ int i;
+
+ /* Check the start address: needs to be page-aligned.. */
+ if (start & ~PAGE_CACHE_MASK)
+ return -EINVAL;
+
+ filp = fget_light(fd, &fput_needed);
+ if (!filp)
+ return -EBADF;
+
+ file_nbytes = i_size_read(filp->f_mapping->host);
+
+ file_npages = (file_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+ /*
+ * Allocate buffer vector page.
+ * Optimize allocation for small values of npages because the
+ * __get_free_page() call doubles fincore(2) runtime when npages == 1.
+ */
+ if (npages <= sizeof(kernel_vec_small)) {
+ kernel_vec = kernel_vec_small;
+ kernel_vec_count = sizeof(kernel_vec_small);
+ } else {
+ kernel_vec = (void *) __get_free_page(GFP_USER);
+ if (!kernel_vec) {
+ retval = -EAGAIN;
+ goto done;
+ }
+ kernel_vec_count = PAGE_SIZE;
+ }
+
+ while (pgoff < pgend) {
+ /*
+ * Do at most kernel_vec_count entries per iteration, due to
+ * the limited buffer size.
+ */
+ for (i = 0; pgoff < pgend && i < kernel_vec_count; pgoff++, i++)
+ kernel_vec[i] = fincore_page(filp->f_mapping, pgoff);
+
+ if (copy_to_user(vec, kernel_vec, i)) {
+ retval = -EFAULT;
+ break;
+ }
+ vec += i;
+
+ if (signal_pending(current)) {
+ retval = -EINTR;
+ break;
+ }
+ cond_resched();
+ }
+ retval = 0;
+done:
+ if (kernel_vec && kernel_vec != kernel_vec_small)
+ free_page((unsigned long) kernel_vec);
+ fput_light(filp, fput_needed);
+ return retval;
+}
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..a3ccd6b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
pnode.o drop_caches.o splice.o sync.o utimes.o \
- stack.o fs_struct.o
+ stack.o fs_struct.o fincore.o

ifeq ($(CONFIG_BLOCK),y)
obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index d76b66a..ce76dc8 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -623,8 +623,11 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 241
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)

+#define __NR_fincore 242
+__SYSCALL(__NR_fincore, sys_fincore)
+
#undef __NR_syscalls
-#define __NR_syscalls 242
+#define __NR_syscalls 243

/*
* All syscalls below here should go away really,
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 581b056..cbf96e6 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -841,4 +841,5 @@ ia32_sys_call_table:
.quad compat_sys_pwritev
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
+ .quad sys_fincore
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..088b235 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,10 +342,11 @@
#define __NR_pwritev 334
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
+#define __NR_fincore 337

#ifdef __KERNEL__

-#define NR_syscalls 337
+#define NR_syscalls 338

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 8d3ad0a..ebc04b5 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
__SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
#define __NR_perf_event_open 298
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_fincore 299
+__SYSCALL(__NR_fincore, sys_fincore)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..1fdc8bc 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
+ .long sys_fincore
--
Chris Frost
http://www.frostnet.net/chris/
.\" Hey Emacs! This file is -*- nroff -*- source.
.\"
.\" Copyright (C) 2001 Bert Hubert <ahu@xxxxxxx>
.\" and Copyright (C) 2007 Michael Kerrisk <mtk.manpages@xxxxxxxxx>
.\" and Copyright (C) 2010 Chris Frost <frost@xxxxxxxxxxx>, UC Regents
.\"
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date. The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein. The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" Created Sun Jun 3 17:23:32 2001 by bert hubert <ahu@xxxxxxx>
.\" Slightly adapted, following comments by Hugh Dickins, aeb, 2001-06-04.
.\" Modified, 20 May 2003, Michael Kerrisk <mtk.manpages@xxxxxxxxx>
.\" Modified, 30 Apr 2004, Michael Kerrisk <mtk.manpages@xxxxxxxxx>
.\" 2005-04-05 mtk, Fixed error descriptions
.\" after message from <gordon.jin@xxxxxxxxx>
.\" 2007-01-08 mtk, rewrote various parts
.\" Adapted for fincore, 2010-02-13, Chris Frost <frost@xxxxxxxxxxx>
.\"
.TH FINCORE 2 2010-02-13 "Linux" "Linux Programmer's Manual"
.SH NAME
fincore \- determine whether buffer cache pages are resident in memory
.SH SYNOPSIS
.B #define _GNU_SOURCE
.br
.B #include <fnctl.h>
.sp
.BI "int fincore(int " fd ", loff_t " start ", loff_t " length ", unsigned char *" vec );
.SH DESCRIPTION
.BR fincore ()
returns a vector that indicates whether pages for a file descriptor
are resident in core (RAM),
and so will not cause a disk access if read.
The kernel returns residency information about the pages
starting at the offset
.IR start ,
and continuing for
.I length
bytes.

The
.I start
argument must be a multiple of the system page size.
The
.I length
argument need not be a multiple of the page size,
but since residency information is returned for whole pages,
.I length
is effectively rounded up to the next multiple of the page size.
One may obtain the page size
.RB ( PAGE_SIZE )
using
.IR sysconf(_SC_PAGESIZE) .

The
.I vec
argument must point to an array containing at least
.I "(length+PAGE_SIZE\-1) / PAGE_SIZE"
bytes.
On return,
the least significant bit of each byte will be set if
the corresponding page is currently resident in memory,
and be clear otherwise.
(The settings of the other bits in each byte are undefined;
these bits are reserved for possible later use.)
Of course the information returned in
.I vec
is only a snapshot: pages that are not
locked in memory can come and go at any moment, and the contents of
.I vec
may already be stale by the time this call returns.
.SH "RETURN VALUE"
On success,
.BR fincore ()
returns zero.
On error, \-1 is returned, and
.I errno
is set appropriately.
.SH ERRORS
.B EAGAIN
kernel is temporarily out of resources.
.TP
.B EBADF
.I fd
is an illegal file descriptor.
.TP
.B EFAULT
.I vec
points to an invalid address.
.TP
.B EINVAL
.I start
is not a multiple of the page size.
.TP
.B EINTR
The call was interrupted by a signal.
.SH VERSIONS
Available since Linux 2.6.? and glibc 2.?.
.SH "CONFORMING TO"
.BR fincore ()
is not specified in POSIX.1-2001,
and it is not available on all Unix implementations.
.\" Does it exist on other systems?
.SH "SEE ALSO"
.BR mincore (2)
.SH COLOPHON
This page is part of release 3.23 of the Linux
.I man-pages
project.
A description of the project,
and information about reporting bugs,
can be found at
http://www.kernel.org/doc/man-pages/.
diff --git a/fs/fincore.c b/fs/fincore.c
index f329fe4..d1a5d65 100644
--- a/fs/fincore.c
+++ b/fs/fincore.c
@@ -11,19 +11,34 @@
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/pagemap.h>
+#include <linux/pagevec.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>

-static unsigned char fincore_page(struct address_space *mapping, pgoff_t pgoff)
+static void fincore_pages(struct address_space *mapping, pgoff_t pgoff,
+ unsigned vec_count, unsigned char *vec)
{
- unsigned char present = 0;
- struct page *page = find_get_page(mapping, pgoff);
- if (page) {
- present = PageUptodate(page);
- page_cache_release(page);
- }
+ struct pagevec pvec;
+ unsigned vec_i = 0;
+
+ pagevec_init(&pvec, 0);
+
+ while (vec_i < vec_count) {
+ unsigned pvec_max = min(vec_count - vec_i, (unsigned) PAGEVEC_SIZE);
+ unsigned pvec_i = 0;
+
+ pagevec_lookup(&pvec, mapping, pgoff, pvec_max);

- return present;
+ for (; vec_i < vec_count && pvec_i < pvec_max; vec_i++, pgoff++) {
+ if (pvec_i >= pagevec_count(&pvec) ||
+ pvec.pages[pvec_i]->index != pgoff)
+ vec[vec_i] = 0;
+ else
+ vec[vec_i] = PageUptodate(pvec.pages[pvec_i++]);
+ }
+
+ pagevec_release(&pvec);
+ }
}

/*
@@ -64,9 +79,8 @@ SYSCALL_DEFINE4(fincore, unsigned int, fd, unsigned char __user *, vec,
loff_t file_nbytes;
pgoff_t file_npages;
unsigned char *kernel_vec = NULL;
- unsigned char kernel_vec_small[64];
+ unsigned char kernel_vec_small[64]; /* 64 is fairly arbitrary */
unsigned kernel_vec_count;
- int i;

/* Check the start address: needs to be page-aligned.. */
if (start & ~PAGE_CACHE_MASK)
@@ -100,16 +114,19 @@ SYSCALL_DEFINE4(fincore, unsigned int, fd, unsigned char __user *, vec,
while (pgoff < pgend) {
/*
* Do at most kernel_vec_count entries per iteration, due to
- * the limited buffer size.
+ * the limited buffer size and the possibility that an loff_t
+ * can store a larger value than an unsigned.
*/
- for (i = 0; pgoff < pgend && i < kernel_vec_count; pgoff++, i++)
- kernel_vec[i] = fincore_page(filp->f_mapping, pgoff);
+ unsigned len = min((uint64_t) (pgend - pgoff),
+ (uint64_t) kernel_vec_count);
+ fincore_pages(filp->f_mapping, pgoff, len, kernel_vec);

- if (copy_to_user(vec, kernel_vec, i)) {
+ if (copy_to_user(vec, kernel_vec, len)) {
retval = -EFAULT;
break;
}
- vec += i;
+ pgoff += len;
+ vec += len;

if (signal_pending(current)) {
retval = -EINTR;