Re: [PATCH v3 1/2] vfs: add rcu-based find_inode variants for iget ops

From: Jan Kara
Date: Tue Jun 11 2024 - 06:50:29 EST


On Tue 11-06-24 12:16:31, Mateusz Guzik wrote:
> Instantiating a new inode normally takes the global inode hash lock
> twice:
> 1. once to check if it happens to already be present
> 2. once to add it to the hash
>
> The back-to-back lock/unlock pattern is known to degrade performance
> significantly, which is further exacerbated if the hash is heavily
> populated (long chains to walk, extending hold time). Arguably hash
> sizing and hashing algo need to be revisited, but that's beyond the
> scope of this patch.
>
> A long term fix would introduce finer-grained locking. An attempt was
> made several times, most recently in [1], but the effort appears
> stalled.
>
> A simpler idea which solves majority of the problem and which may be
> good enough for the time being is to use RCU for the initial lookup.
> Basic RCU support is already present in the hash. This being a temporary
> measure I tried to keep the change as small as possible.
>
> iget_locked consumers (notably ext4) get away without any changes
> because inode comparison method is built-in.
>
> iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
> removal of locking adds more problems (inode can be changing) it's not
> safe to assume all filesystems happen to cope. Thus iget5_locked_rcu,
> ilookup5_rcu and ilookup5_nowait_rcu get added, requiring manual
> conversion.
>
> In order to reduce code duplication find_inode and find_inode_fast grow
> an argument indicating whether inode hash lock is held, which is passed
> down should sleeping be necessary. They always rcu_read_lock, which is
> redundant but harmless. Doing it conditionally reduces readability for
> no real gain that I can see. RCU-alike restrictions were already put on
> callbacks due to the hash spinlock being held.
>
> There is a real cache-busting workload scanning millions of files in
> parallel (it's a backup server thing), where the initial lookup is
> guaranteed to fail resulting in the 2 lock acquires.
>
> Implemented below is a synthehic benchmark which provides the same
> behavior. [I shall note the workload is not running on Linux, instead it
> was causing trouble elsewhere. Benchmark below was used while addressing
> said problems and was found to adequately represent the real workload.]
>
> Total real time fluctuates by 1-2s.
>
> With 20 threads each walking a dedicated 1000 dirs * 1000 files
> directory tree to stat(2) on a 32 core + 24GB RAM vm:
>
> ext4 (needed mkfs.ext4 -N 24000000):
> before: 3.77s user 890.90s system 1939% cpu 46.118 total
> after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
>
> Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
>
> [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@xxxxxxxxxxxxx/
>
> Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx>

Nice speedups and the patch looks good to me. It would be lovely to get
Dave's speedups finished but this is already nice. I've found just two nits:

> +/**
> + * ilookup5 - search for an inode in the inode cache
^^^ ilookup5_rcu

> + * @sb: super block of file system to search
> + * @hashval: hash value (usually inode number) to search for
> + * @test: callback used for comparisons between inodes
> + * @data: opaque data pointer to pass to @test
> + *
> + * This is equivalent to ilookup5, except the @test callback must
> + * tolerate the inode not being stable, including being mid-teardown.
> + */
...
> +struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval,
> + int (*test)(struct inode *, void *), void *data);

I'd prefer wrapping the above so that it fits into 80 columns.

Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR