Re: [RFC PATCH] vfs: add rcu-based find_inode variants for iget ops
From: Mateusz Guzik
Date: Fri Jun 07 2024 - 03:52:51 EST
On Fri, Jun 07, 2024 at 12:04:58PM +1000, Dave Chinner wrote:
> On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote:
> > Instantiating a new inode normally takes the global inode hash lock
> > twice:
> > 1. once to check if it happens to already be present
> > 2. once to add it to the hash
> >
> > The back-to-back lock/unlock pattern is known to degrade performance
> > significantly, which is further exacerbated if the hash is heavily
> > populated (long chains to walk, extending hold time). Arguably hash
> > sizing and hashing algo need to be revisited, but that's beyond the
> > scope of this patch.
> >
> > A long term fix would introduce fine-grained locking, this was attempted
> > in [1], but that patchset was already posted several times and appears
> > stalled.
>
> Why not just pick up those patches and drive them to completion?
>
Time constraints on my end aside.
>From your own e-mail [1] last year problems are:
> - A lack of recent validation against ext4, btrfs and other
> filesystems.
> - the loss of lockdep coverage by moving to bit locks
> - it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks
> inside the inode_hash_lock and we can't do that if we convert the
> inode hash to bit locks because RT makes spinlocks sleeping locks.
> - There's been additions for lockless RCU inode hash lookups from
> AFS and ext4 in weird, uncommon corner cases and I have no idea
> how to validate they still work correctly with hash-bl. I suspect
> they should just go away with hash-bl, but....
> There's more, but these are the big ones.
I did see the lockdep and preempt_rt problem were patched up in a later
iteration ([2]).
What we both agree on is that the patchset adds enough complexity that
it needs solid justification. I assumed one was there on your end when
you posted it.
For that entire patchset I don't have one. I can however justify the
comparatively trivial thing I posted in this thread.
That aside if I had to make the entire thing scale I would approach
things differently, most notably in terms of locking granularity. Per
your own statement things can be made to look great in microbenchmarks,
but that does not necessarily mean they help. A lot of it is a tradeoff
and making everything per-cpu for this particular problem may be taking
it too far.
For the inode hash it may be the old hack of having a lock array would
do it more than well enough -- say 32 locks (or some other number
dependent on hash size) each covering a dedicated subset. This very
plausibly would scale well enough even in a microbenchmark with a high
core count. Also note having regular spinlocks there would would retain
all lockdep and whatnot coverage, while having smaller memory footprint
than the variant coming in [2] which adds regular spinlocks for each
bucket.
In a similar spirit I'm not convinced per-cpu lists for super blocks are
warranted either. Instead, some number per numa domain or core count,
with backing area allocated from respective domain would probably do the
trick well enough. Even if per-cpu granularity is needed, a mere array
allocation as seen in the dlist patch warrants adjustment:
+ dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+ GFP_KERNEL);
that is it should alloc separate arrays per domain, but then everything
else has to be adjusted to also know which array to index into. Maybe it
should use the alloc_percpu machinery.
All that said, if someone(tm) wants to pick up your patchset and even
commit it the way it is right now I'm not going to protest anything.
I don't have time nor justification to do full work My Way(tm).
I did have time to hack up the initial rcu lookup, which imo does not
require much to justify inclusion and does help the case I'm interested
in.
[1] https://lore.kernel.org/linux-fsdevel/ZUautmLUcRyUqZZ+@xxxxxxxxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-11-david@xxxxxxxxxxxxx/