Re: sysfs: tagged directories not merged completely yet
From: Eric W. Biederman
Date: Mon Oct 13 2008 - 21:20:29 EST
Tejun Heo <tj@xxxxxxxxxx> writes:
> Hello, Greg.
>
> Greg KH wrote:
>> On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
>>> Unless someone will give an example of how having multiple superblocks
>>> sharing inodes is a problem in practice for sysfs and call it good
>>> for 2.6.28. Certainly it shouldn't be an issue if the network namespace
>>> code is compiled out. And it should greatly improve testing of the
>>> network namespace to at least have access to sysfs.
>>
>> But if the network namespace code is in? THen we have problems, right?
>> And that's the whole point here.
>>
>> The fact that you are trying to limit userspace view of in-kernel data
>> structures, based on that specific user, is, in my opinion, crazy.
>
> Well, that's the whole point of all the namespace stuff. If we're
> gonna do namespaces, view of in-kernel data structures need to be
> limited and modified one way or the other.
>
>> Why not just keep all users from seeing sysfs, and then have a user
>> daemon doing something on top of FUSE if you really want to see this
>> kind of stuff.
>
> That sounds nice. Out of ignorance, how is the /proc dealt with?
> Maybe we can have some unified approach for this multiple views of the
> system stuff.
/proc uses just about every trick in the book to make this work.
/proc/sys uses a magic d_compare method.
/proc/net becomes a symlink to /proc/<pid>/net and we get completely
different directory trees below that. Shortly that code will
use auto mounts of a proc_net filesystem, that has different
super blocks one for each different network namespace.
/proc/sysvipc/* simply returns different values from it's files depending
upon which process is reading them.
/proc itself has multiple super blocks one for each different pid namespace.
The long term direction is to be able to see everything at once. If
you mount all of the filesystems multiple times in the proper way.
That allows monitoring software to watch what is going on inside of a
container without a challenge, and it makes it a user space policy
decision how much an individual container sees.
For sysfs there isn't have the option of putting things under
/proc/<pid>, the directories I am interested in (at least for network
devices) are scattered all over sysfs and come and go with device
hotplug events so I don't see a realistic way of splitting those
directories out into their own filesystem.
>From a user interface design perspective I don't see a good
alternative to having /sys/class/net/, /sys/virtual/net/, and all of
the other directories different based on network namespace. Then have
the network namespace be specified by super block. Looking at current
and doing the magic d_compare trick almost works, but it runs into
problems with sysfs_get_dentry.
>From the perspective of the internal sysfs data structures tagged
dirents are clean and simple so I don't see a reason to re-architect
that.
I have spent the last several days looking deeply at what the vfs
can do, and how similar situations are handled. My observations
are:
1) exportfs from nfsd is similar to our kobject to sysfs_dirent layer,
and solves that set of problems cleanly, including remote rename.
So there is no fundamental reason we need inverted twisting locking in
sysfs, or otherwise violate existing vfs rules.
2) i_mutex seems to protect very little if anything that we care about.
The dcache has it's own set of locks. So we may be able to completely
avoid taking i_mutex in sysfs and simplify things enormously.
Currently I believe we are very similar to ocfs2 in terms of locking
requirements.
3) For i_notify and d_notify that seems to require pinning the inode
or the dentry in question, so I see no reason why a d_revalidate
style of update will have problems.
4) For finer locking granularity of readdir. All we need to do is do
the semi-expensive restart for each dirent, and the problem is
trivially solved.
5) Large directories are a potential performance problem in sysfs.
So it appears that the path forward is:
- Cleanup sysfs locking and other issues.
- Return to the network namespace code.
Possibly with an intermediate step of only showing the network
devices in the initial network namespace in sysfs.
> Can somebody hammer the big picture regarding namespaces into my
> small head?
100,000 foot view. A namespace introduces a scope so multiple
objects can have the same name. Like network devices.
10,000 foot view. The network namespace looks to user space
as if the kernel has multiple independent network stacks.
1000 foot view. I have two network devices named lo, and sysfs
does not currently have a place for me to put them.
Leakage and being able to fool an application that it has the entire
kernel to itself are not concerns. The goal is simply to get the
entire object name to object translation boundary and the namespace
work is done. We have largely achieved, and the code to do
so once complete is reasonable enough that it should be no
worse than dealing with any other kernel bug.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/