Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

From: Linus Torvalds
Date: Sun Nov 26 2023 - 15:24:22 EST


On Sun, 19 Nov 2023 at 23:11, kernel test robot <oliver.sang@xxxxxxxxx> wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")

Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:

> 30.90 ą 4% -20.6 10.35 ą 2% perf-profile.self.cycles-pp.__fget_light
> 0.00 +26.5 26.48 perf-profile.self.cycles-pp.__get_file_rcu

and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.

I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().

The 'fd' case really is special because we need to do that
non-speculative pointer access.

Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.

So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:

/* Mask is a 0 for invalid fd's, ~0 for valid ones */
mask = array_index_mask_nospec(fd, fdt->max_fds);

and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):

/* fdentry points to the 'fd' offset, or fdt->fd[0] */
fdentry = fdt->fd + (fd&mask);

and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:

/* Do the load, then mask any invalid result */
file = rcu_dereference_raw(*fdentry);
file = (void *)(mask & (unsigned long)file);

but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".

Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.

I made files_lookup_fd_raw() do the same thing.

The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.

Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.

Christian?

NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.

IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...

Linus
fs/file.c | 48 ++++++++++++++++++++++++++++++------------------
include/linux/fdtable.h | 15 ++++++++++-----
2 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..c74a6e8687d9 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -959,31 +959,42 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
struct file *file;
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
struct file __rcu **fdentry;
+ unsigned long mask;

- if (unlikely(fd >= fdt->max_fds))
+ /* Mask is a 0 for invalid fd's, ~0 for valid ones */
+ mask = array_index_mask_nospec(fd, fdt->max_fds);
+
+ /* fdentry points to the 'fd' offset, or fdt->fd[0] */
+ fdentry = fdt->fd + (fd&mask);
+
+ /* Do the load, then mask any invalid result */
+ file = rcu_dereference_raw(*fdentry);
+ file = (void *)(mask & (unsigned long)file);
+
+ if (unlikely(!file))
return NULL;

- fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
+ /*
+ * Ok, we have a file pointer that was valid at
+ * some point, but it might have become stale since.
+ *
+ * We need to confirm it by incrementing the refcount
+ * and then check the lookup again.
+ *
+ * atomic_long_inc_not_zero() gives us a full memory
+ * barrier. We only really need an 'acquire' one to
+ * protect the loads below, but we don't have that.
+ */
+ if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+ continue;

/*
- * Ok, we have a file pointer. However, because we do
- * this all locklessly under RCU, we may be racing with
- * that file being closed.
- *
* Such a race can take two forms:
*
* (a) the file ref already went down to zero and the
* file hasn't been reused yet or the file count
* isn't zero but the file has already been reused.
- */
- file = __get_file_rcu(fdentry);
- if (unlikely(!file))
- return NULL;
-
- if (unlikely(IS_ERR(file)))
- continue;
-
- /*
+ *
* (b) the file table entry has changed under us.
* Note that we don't need to re-check the 'fdt->fd'
* pointer having changed, because it always goes
@@ -991,7 +1002,8 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
*
* If so, we need to put our ref and try again.
*/
- if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
+ if (unlikely(file != rcu_dereference_raw(*fdentry)) ||
+ unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
fput(file);
continue;
}
@@ -1128,13 +1140,13 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
* atomic_read_acquire() pairs with atomic_dec_and_test() in
* put_files_struct().
*/
- if (atomic_read_acquire(&files->count) == 1) {
+ if (likely(atomic_read_acquire(&files->count) == 1)) {
file = files_lookup_fd_raw(files, fd);
if (!file || unlikely(file->f_mode & mask))
return 0;
return (unsigned long)file;
} else {
- file = __fget(fd, mask);
+ file = __fget_files(files, fd, mask);
if (!file)
return 0;
return FDPUT_FPUT | (unsigned long)file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index bc4c3287a65e..a8a8b4d24619 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -83,12 +83,17 @@ struct dentry;
static inline struct file *files_lookup_fd_raw(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+ unsigned long mask = array_index_mask_nospec(fd, fdt->max_fds);
+ struct file *needs_masking;

- if (fd < fdt->max_fds) {
- fd = array_index_nospec(fd, fdt->max_fds);
- return rcu_dereference_raw(fdt->fd[fd]);
- }
- return NULL;
+ /*
+ * 'mask' is zero for an out-of-bounds fd, all ones for ok.
+ * 'fd|~mask' is 'fd' for ok, or 0 for out of bounds.
+ *
+ * Accessing fdt->fd[0] is ok, but needs masking of the result.
+ */
+ needs_masking = rcu_dereference_raw(fdt->fd[fd&mask]);
+ return (struct file *)(mask & (unsigned long)needs_masking);
}

static inline struct file *files_lookup_fd_locked(struct files_struct *files, unsigned int fd)