Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
From: Ian Kent
Date: Wed May 20 2026 - 03:16:44 EST
On 19/5/26 17:12, Jan Kara wrote:
On Mon 18-05-26 21:39:13, Ian Kent wrote:
On 18/5/26 16:19, Jan Kara wrote:OK, but that's not necessarily true. I have seen these complaints from the
Hi Ian,Didn't quantify that.
On Mon 18-05-26 10:55:43, Ian Kent wrote:
On 18/5/26 07:55, NeilBrown wrote:Do you mean there are very many positive children of a directory?
On Fri, 15 May 2026, Horst Birthelmer wrote:But the notion of dropping the dentry in ->d_delete() on last dput() is
According to the email you linked, a problem arises when a directory has
a great many negative children. Code which walks the list of children
(such as fsnotify) while holding a lock can suffer unpredictable delays
and result in long lock-hold times. So maybe a limit on negative
dentries for any parent is what we really want. That would be clumsy to
implement I imagine.
simple enough but did see regressions (the only other place in the VFS
besides dentry_kill() that the inode is unlinked from the dentry on
dput()). I wonder if the regression was related to the test itself
deliberately recreating deleted files and if that really is normal
behaviour. By itself that should prevent almost all negative dentries
being retained. Although file systems could do this as well (think XFS
inode recycling) it should be reasonable to require it be left to the
VFS.
But even that's not enough given that, in my case, there would still be
around 4 million dentries in the LRU cache and in fsnotify there are
directory child traversals holding the parent i_lock "spinlock" that are
going to cause problems.
The symptom is the "Spinlock held for more than ... seconds" occurring in
the log. So there are certainly a lot of children in the list, but it's
an assumption the ratio of positive to negative entries is roughly the
same as the overall ratio in the dcache.
kernel but in all the cases I remember it was due to negative dentries
accumultating in a particular directory. There are certain apps such as
ElasticSearch which really do like creating huge amounts of negative
dentries in one directory - they use hashes as filenames and use directory
lookup instead of a DB table lookup and lookup lots of non-existent keys...
Umm ... that's a good point, I hadn't paid much attention to ENOENT result
lookups, I'll need to check on the like cycle of those, I think they do get
hashed. That has to be the other source of negative dentries that I've
neglected ...
OK, thanks.Yes, that traversal is what I'm questioning ... again thanks.so why is this traversal even retained in fsnotify?Not sure which traversal you mean but if you set watch on a parent, you
have to walk all children to set PARENT_WATCHED flag so that you don't miss
events on children...
I think the function name is still fsnotify_set_children_dentry_flags()
in recent kernels, the subject of commit 172e422ffea2 I mentioned above.
When you say miss events are you saying that accessing the parent dentry toClose but not quite. The cost is the overhead of dget_parent() in
work out if the child needs to respond to an event is quite expensive in the
overall event processing context, that might make more sense to me ... or do
I completely not yet understand the reasoning behind the need for the flag?
fsnotify_parent() which is often a couple of cache cold loads and atomic
instructions to find out we don't need to send any event for the current
write(2) or read(2) call. It gets worse if there are many IOs happening to
dentries in the same directory from multiple CPUs because instead of
cache-cold loads you get a cacheline contention on the parent.
Parent dentry is of course in memory but often cache cold - you don't needThat sounds like a lot for what should be a memory access of an already inThe parent access is actually more expensive than you might think. Based onBut what if we move dentries to the end of the list when they becomeAnother good question.
negative, and to the start of the list when they become positive? Then
code which walks the child list could simply abort on the first
negative.
I doubt that would be quite as easy as it sounds, but it would at least
be more focused on the observed symptom rather than some whole-system
number which only vaguely correlates with the observed symptom.
Maybe a completely different approach: change children-walking code to
drop and retake the lock (with appropriate validation) periodically.
What too would address the specific symptom.
I have assumed that dropping and re-taking the lock cannot be done but
this is a question I would like answered as well. Dropping and re-taking
lock would require, as Miklos pointed out to me off-list, recording the
list position with say a cursor, introducing unwanted complexity when it
would be better to accept the cost of a single extra access to the parent
flags (which I assume is one reason to set the flag in the child).
experience with past fsnotify related performance regression I expect some
20% performance hit for small tmpfs writes if you add unconditional parent
access to the write path.
memory structure since the parent must be accessed to traverse the list of
child entries. I clearly don't fully understand the implications of what
I'm saying but there has been mention of another context ...
the parent to do e.g. write(2) to an already open file. You seem to be
somewhat confused about the child dentry list traversal (or maybe I'm
misunderstanding) - that happens only when placing the notification mark
but definitely not for each IO operation.
LOL, confusion is a pretty common state of mind for me!
I do get your point though and I am confusing the traversal with other
operations. I think this answers the question I've been asking (maybe
that wasn't obvious) about the reason for the traversal (ie. the reason
to maintain a flag in the child).
While I have looked at the code here I haven't absorbed it and I
definitely don't understand it, your continued patience is appreciated
and will be beneficial when I get time to look at it a bit closer. I
do still need to use a notifications mechanism to match up with Miklos's
statmount implementation to get the full benefit of that in user space,
if I ever get a chance to work on that again.
So it sounds like it would be worth while considering a traversal that's
based on taking a reference on each dentry rather than a spinlock for
the duration. It would be tricky though, for obvious reasons, like
children added during the traversal, added overhead of getting the next
entry reference, etc.
Ian