Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper

From: Jan Kara

Date: Mon May 18 2026 - 04:24:31 EST

Hi Ian,

On Mon 18-05-26 10:55:43, Ian Kent wrote:
> On 18/5/26 07:55, NeilBrown wrote:
> > On Fri, 15 May 2026, Horst Birthelmer wrote:
> > According to the email you linked, a problem arises when a directory has
> > a great many negative children. Code which walks the list of children
> > (such as fsnotify) while holding a lock can suffer unpredictable delays
> > and result in long lock-hold times. So maybe a limit on negative
> > dentries for any parent is what we really want. That would be clumsy to
> > implement I imagine.
>
> But the notion of dropping the dentry in ->d_delete() on last dput() is
> simple enough but did see regressions (the only other place in the VFS
> besides dentry_kill() that the inode is unlinked from the dentry on
> dput()). I wonder if the regression was related to the test itself
> deliberately recreating deleted files and if that really is normal
> behaviour. By itself that should prevent almost all negative dentries
> being retained. Although file systems could do this as well (think XFS
> inode recycling) it should be reasonable to require it be left to the
> VFS.
>
> But even that's not enough given that, in my case, there would still be
> around 4 million dentries in the LRU cache and in fsnotify there are
> directory child traversals holding the parent i_lock "spinlock" that are
> going to cause problems.

Do you mean there are very many positive children of a directory?

> That's all that much more puzzling when I see things like commit
> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily") which looks
> like it implies the child flag depends entirely on the parent state (what
> am I missing Amir?)

PARENT_WATCHED dentry flags (as the name suggests) are only caching the
information whether the parent has notification marks receiving events from
the child. So yes, the flag fully depends on the parent state.

> so why is this traversal even retained in fsnotify?

Not sure which traversal you mean but if you set watch on a parent, you
have to walk all children to set PARENT_WATCHED flag so that you don't miss
events on children...

> > But what if we move dentries to the end of the list when they become
> > negative, and to the start of the list when they become positive? Then
> > code which walks the child list could simply abort on the first
> > negative.
> >
> > I doubt that would be quite as easy as it sounds, but it would at least
> > be more focused on the observed symptom rather than some whole-system
> > number which only vaguely correlates with the observed symptom.
> >
> > Maybe a completely different approach: change children-walking code to
> > drop and retake the lock (with appropriate validation) periodically.
> > What too would address the specific symptom.
>
> Another good question.
>
> I have assumed that dropping and re-taking the lock cannot be done but
> this is a question I would like answered as well. Dropping and re-taking
> lock would require, as Miklos pointed out to me off-list, recording the
> list position with say a cursor, introducing unwanted complexity when it
> would be better to accept the cost of a single extra access to the parent
> flags (which I assume is one reason to set the flag in the child).

The parent access is actually more expensive than you might think. Based on
experience with past fsnotify related performance regression I expect some
20% performance hit for small tmpfs writes if you add unconditional parent
access to the write path.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR