Re: [PATCH] proc/sysctl: drop unregistered stale dentries as soon as possible

From: Al Viro
Date: Wed Feb 08 2017 - 22:54:09 EST

On Wed, Feb 08, 2017 at 01:48:04PM -0800, Andrew Morton wrote:

> > This patch detects stale dentry in proc_sys_compare and pretends that
> > it has matching name - revalidation will kill it and lookup restarts.
> > As a result each stale dentry will be seen only once and will not
> > contaminate hash endlessly.
> >
> What are "stale" dentries? Unused dentries? If so, why doesn't the
> creation of a new dentry immediately invalidate the old dentry with a
> matching path? What do other filesystems do to prevent this issue?

The whole point is that it's *NOT* a matching path. Currently ->d_compare()
for /proc/sys tries to make sure that sysctl getting unregistered means
that no extra references will be added to dentries of the stuff we are
trying to kick out. If it's getting unregistered, ->d_compare() won't be
seeing them and from that point on dentry refcount can only go down -
no new lookups will increase it.

This kludge tries to have _any_ lookup in the same hash chain pick the
first dentry of such stuff, no matter what name/parent it has. Then
it relies upon ->d_revalidate() refusing to accept that sucker, so that
it gets unhashed and we (hopefully) repeat the lookup.

This is complete garbage. Lookups won't be repeated indefinitely -
if there are several such dentries in the hash chain we search, syscall
will end up failing with ESTALE on thus buggered ->d_compare(), even though
none of those dentries are anywhere near the path we are trying to resolve.
No other filesystem attempts that kind of insanity, and for a good reason.

The problem it tries to address is that sysctl unregistration doesn't
unhash the now-stale dentries. Before the unregistration we kept them
even with refcount 0, until memory pressure evicts the suckers. After
unregistration we make sure that refcount reaching 0 will cause the
instant eviction. The problem is with the case when they had refcount 0
to start with. Then the eviction rule does not get triggered - it would have
happened when dropping the last reference, but we don't have any.

The kludge proposed in that patch is nowhere near being a sane way to deal
with that. Having ->d_compare() notice such dentries and quietly kick
them out would be borderline saner, but
* it's a potentially blocking operation and ->d_compare() is called
in non-blocking contexts, including deep under rcu_read_lock().
* it's done when walking a hash chain; having that chain modified
by ->d_compare() itself would require some modifications of callers and
those are very hot codepaths.

I agree that the problem is real, but this is no way to deal with it.
What we want is something along the lines of d_prune_aliases() done for all
inodes corresponding to given sysctl. Done just before erase_header()
in start_unregistering(). That would require maintaining the list of
inodes in question (e.g. anchored in ctl_table_header) and a bit of care
in traversing it (use of igrab(), etc.)

In the current form - NAK. Sorry.