Re: sysfs: tagged directories not merged completely yet

From: Daniel Lezcano
Date: Tue Oct 07 2008 - 05:01:52 EST


Al Viro wrote:
On Tue, Sep 23, 2008 at 11:23:57AM -0700, Eric W. Biederman wrote:
Benjamin Thery <benjamin.thery@xxxxxxxx> writes:
Oh.
It's a pity Al couldn't re-review them before. We've already lost a lot
of time with this patchset and it's blocking easier testing of network
namespaces (right now, with a mainline kernel, we have to disable sysfs
to build network namespaces).
I am confident that we have a good base with these patches and the rest of
the work can be done incrementally on top of them if any issues turn up.

Al recent rework of sysctl has a very similar structure.

No, it does not. My apologies for delay, but here are more printable parts
of review.

First of all, this stuff breaks just about every damn integrity constraint VFS
knows of. It tries to tiptoe through the resulting minefield - without
success. So the first group of comments will be of "you *really* don't
do $FOO" variety. I'm very far from being convinced that we want to
special-case in VFS every kind of weirdness sysfs happens to do; in effect,
that would require adding a FS_IS_SYSFS_SO_BEND_OVER fs type flag and making
a lot of locking conditional on that.

a) You do *not* share struct inode between the superblocks, for fsck sake!
b) You do *not* grab i_mutex on ancestors after having grabbed it on
file, as sysfs_chmod_file() does.
c) You do *not* change dentry tree topology without s_vfs_rename_mutex on
affected superblock. That, BTW, is broken in mainline sysfs as well.
d) You REALLY, REALLY do not unhash busy dentries of directories.

In addition to that, there are interesting internal problems:
* inumbers are released by final sysfs_put(); that can happen before the
final iput() on corresponding inode. Guess what happens if new entry is
created in the meanwhile, reuses the same inumber and lookup gets to
sysfs_get_inode() on it?
* may I politely suggest that
again:
mutex_lock(&A);
if (!mutex_trylock(&B)) {
mutex_unlock(&A);
goto again;
}
is somewhat, er, deficient way to deal with buggered locking hierarchy?
Not to mention anything else, that's obviously FUBAR on UP box - if we
have B contended, we've just killed the box dead since we'll never give
the CPU back to whoever happens to hold B. See sysfs_mv_dir() for a lovely
example.
* sysfs_count_nlink() is called from sysfs_fill_super() without sysfs_mutex;
now this sucker can get called at any moment.
* just what is protecting ->s_tag?
* __sysfs_remove_dir() appears to assume that subdirectories are possible;
AFAICS, if we *do* get them, we get very screwed after remove_dir().
* everything else aside, the internal locking is extremely heavy. For
fsck sake, guys, a single system-wide mutex that can be grabbed for the
duration of readdir on any directory and block just about anything
in the filesystem? Just mmap() something over NFS on a slow link and
do getdents() to such buffer. Watch a *lot* of stuff getting buggered.
Hell, you can't even do ifconfig up while that sucker is held...

Seriously, people, it's getting worse than devfs had ever been ;-/

Thank you Al for reviewing the patchset.

-- Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/