Hi there all.
I have been watching the devfs debate for the last year or so with
interest and amusement ... and occasional boredom:-)
I must admit that I have sympathies for both sides. I think something
is definitely needed, and devfs certainly solves some problems, but
is it the "best" solution...?...
Or, put another way, I'm sure that it can be made to work, but I
suspect there could be a better way of doing it.
Possibly more informative than the content of the argument, is the
fact of the argument. The fact that (apparently) intelligent and
rational people cannot agree - or even agree to differ - about a
technical issue seems very significant.
Based on some years of observing and experiencing human behaviour, my
view is that when two people cannot agree on something, it is almost
always because they are failing to communicate. Either they don't
correctly interpret the words that the other is using or (more
commonly) they are working from different sets of basic assumptions -
they have different axioms. If you can pinpoint this failure to
communicate - if you can identify the item of information that one is
assuming and the other is not (or is assuming to be different), they
you can usually resolve the issue.
Obviously I believe there is a communication failure going on -- a
common understanding that is missing. In this mail item I hope to
expose an important issue that seems to be being glossed over, and
then to elaborate the implications of this issue. You may still not
like the resulting proposal, but I will have succeeded if people have
a clearer idea of what the divisive issues really are.
Like most important truths, the "important issue" can be stated in a
simple sentence (cf John 3:16) but can benefit from substantial
elaboration (cf the rest of the Bible).
My one sentence summary:
Device special files are *not* devices, they are gateways to
devices.
Before I embark on the elaboration, it might help to identify some
particular issues that seem to have caused particular disagreement. I
believe that the approach discussed below answers all of these issues
to some degree. I'll let you be the judge.
1/ persistence of permissions on device files - not trivial when
device files are not persistent. Several solutions have been
discussed with no clear agreement.
2/ /dev in a chroot gaol. This requires a /dev which is the same
as, but different too, the "real" /dev.
3/ 16 bit device numbers are too small. Do we enlarge them?
Deprecate them? If so, how?
4/ Where and how is devfs mounted? /dev? /devices? at the same time
as /? at the same time as /proc?
5/ The choice of names of things in devfs - the Linus imposed scheme
vs the original scheme.
The rest of the memo comprises:
A: A discussion of what device special files really are.
B: A brief outline of what (I think) I would like a device
filetree to look like.
C: A new construct to carry device special files into the next
century.
D: Some notes on backward compatibility.
E: Some closing comments.
F: My signature :-)
A: A discussion of what devices special files really are.
You probably mostly know all this, but it needs to be said up front
to make sure that we are communicating.
In *traditional* Unix (I'm thinking Edition 7 Unix from Bell Labs
specifically, though not that much has changed), devices were
addressed by a static, 3.5 level numeric hierarchy.
The three levels of this hierarchy are:
1/ Block or Character devices (== fronted by buffer cache or not)
2/ Major device number - Each number identified a driver
3/ Minor device number - identified a particular device
managed (driven?) by that driver.
The extra 0.5 level of hierarchy came because some devices had
sub-components: disc drives had partitions. Tape drives had
rewind-on-close behaviour and no-rewind-on-close behaviour. This
resulted in some drivers splitting the bits in the minor number up
into a device identifier and some extra bits to indicate how to
interact with the device.
Clearly, the three level hierarchy was already limiting back then.
That is the device hierarchy. Unix wanted to (and this is one of
the great strengths of Unix) access the device hierarchy from the
filesystem. Hence the device-special files.
A device-special file has ownership and Access Control*, and
identifies a particular device. (note that devices in general have
no ownership and no access control, though some drivers might
restrict some ioctls (e.g. format-disc) to root).
The semantic of the device-special file is that if the ACL gives you
some particular access, you have that access to the referenced
device, rather than access to the device special file
itself. (i.e. write access doesn't mean that you can change the
device-special file, only that you can write to the device).
Only root can create device-special files.
Hopefully this explains what I mean when I say that device special
files are only gateways to devices, not the devices themselves.
It is worth noting that it is quite possible to create two different
device special files which refer to the same device, but have
different owners and access control. I haven't ever come across a
case where this is useful, but it does help highlight the
difference.
* When I refer to Access Control or ACL, I am primarily talking about
standard Unix ugo+rwx access control, but any other ACL scheme could
be used equally well.
B: A brief outline of what (I think) I would like a device
filetree to look like.
The traditional Unix device tree is clearly limiting. There are two
particular aspects that are limiting.
1/ 3.5 level hierarchy is too rigid.
2/ numeric identifiers are hard to manage, and not human-friendly.
The "obvious" response to this is to have a hierarchy that looks
like a filesystem - with textual names for elements and arbitrarily
many level as suits particular types of devices - and this is what
devfs does.
My reason for proposing something different to the current devfs
structure is that I am coming to the problem with different
priorities. devfs seems to want to copy the traditional layout of
/dev, and with good reason. I have no desire to mimic that, but
instead a desire to mimic the 3-level hierarchy of devices numbers -
but take it a bit further.
This is far from a complete proposal and is intended to be largely
indicative of what is possible rather the prescriptive of what
should be done.
Also, my knowledge of device technologies and nomenclature are
fairly superficial, so please excuse me if I say something silly.
The approach I would take to the device hierarchy is to have a
controller/instance/function hierarchy where a "function" may be a
"controller" of a different sort and so would have a symlink back up to
the top.
Taking for example my PCI based AHA2950 dual port SCSI controller
with several discs on one port and a DLT library on the other, it
might look something like:
pci/ directory containing all pci busses
pci/0/ directory containing pci buss 0.
Presumably PCI controllers have some sort
of address (IO port? memory?) so this might
be pci/0x880000000/ or some such.
pci/0/2/ directory of information about device 2 on
pci buss 0. This '2' has a physical
meaning, it is not a sequential number.
Possibly this could be pci/0/device/2 so
that other information could go in pci/0
and not get confused with devices.
pci/0/2/vendor file containing vendor id. Yes, procfs-like
stuff goes here too.
pci/0/2/deviceid file containing deviceid. You might prefer
vendor and deviceid to be one file.. so might I.
pci/0/2/function/ directory containing the different
functions of the device.
pci/0/2/function/0 -> ?scsi/0
a symlink saying that function 0 is the
first scsi buss found.
The '?' means that something should go here
to say "go back to top of device
hierarchy". It could be a sequence of "../"s,
or possibly something else. See later.
pci/0/2/function/1 -> ?scsi/1
a symlink identifying that this is the
second scsi buss.
scsi/ directory containing all scsi busses
scsi/0 first scsi buss. the '0' is a purely
sequential number it has no external
meaning.
scsi/0/master -> ?pci/0/2/function/0
A symlink so that you can find out where
this scsi controller exists. "master"
probably isn't a good name.
scsi/0/2/ A directory with information about device 2
on scsi buss 0. In this case a disc drive.
Again, possibly scsi/0/device/2
scsi/0/2/function/0 -> ?disc/1
This is a disc drive (though your's might
be a disk drive:-). It is drive 1. (drive 0
is probably ide/0/0 which is pci/0/1 ...)
scsi/0/3/function/0 -> ?disc/2
scsi/0/4/function/0 -> ?disc/3
more discs.
scsi/1/0/function/0 -> ?tape/0
LUN0 of device 0 on this scsi buss is a
tape drive.
scsi/1/0/function/1 -> ?changer/0
LUN1 is an auto-changer to changing tapes.
disc/ directory for all disc drives. I think I
would include all ide and scsi
non-removable magnetic drives here. There
would be separate cdrom/ floppy/. I'm not
sure where ZIP drives would go.
Possibly the interesting distinction is
removable/non-removable....
disc/1/ Information about disc 1
disc/1/master -> ?scsi/0/2/function/0
disc/1/device This *is* the device. You read/write this
to access the disc drive. It might look (to stat())
like a device special file, or it might
look like a named pipe or a socket.
disc/1/partition -> ?partition/1
Where to find the partitions.
I might be taking the levels of indirection
too far here. a disc/1/parition/ directory
might be better.
partition/1/ Partitions on second partitions device
partition/1/style file containing the word "msdos\n"
partition/1/master -> ?disc/1
partition/1/table raw partition table in format according to
"style"
partition/1/1/part partition 1. This is the real partition.
You can open/read/write/mount this. It
might look (to stat) like a device special
file or a socket or ...
partition/1/2/part partition 2
partition/1/2/partition -> ?partition/2
Partition two has "extended" partitions in
it! Again, I might be taking indirection
too far.
I think (hope) that you get the idea. The device tree reflects the
physical organisation of devices where possible, and allows for
"virtual" devices to help flatten the hierarchy. The tree contains
not only devices, but also information about devices such as is
often found in /proc.
The ownership/access control on things within the tree is minimal
and (mostly) not changeable. Almost everything is owned by root. A
exception might be slave ptys which were created by a particular
user are owned by that user and can have their permissions changed.
Access control is mostly wide open (you will see why later). Some
things might only be writable by root. Directories are read-only.
We already have some things in the tree that are not physical
devices - partitions. They are really a layer of interpretation
on top of the data in the device. There are other interpretations
that we put on top of data, and they should be reflected in the
tree.
filesystem/ (maybe fs/) directory of all filesystems
filesystem/ext2 sub-dir of ext2 file systems
filesystem/ext2/some-long-hex-uuid/
a particular filesystem. If there are uuid
conflicts, or if the filesystem doesn't
support uuids, some sequential number would
be used.
filesystem/ext2/some-long-hex-uuid/dev -> ?disc/1/device
filesystem/ext2/some-long-hex-uuid/fs/
The actual filesystem appears to be mounted
here, was well as at /usr (or wherever)
thanks to Al Viro's new vfs mounting stuff.
filesystem/ext2/some-long-hex-uuid/mountpoints
a file listing current mount points, or
maybe a directory full of symlinks to the
mount points.
md/other-hex-uuid/
directory with assorted stuff about an md
device. There would be a link to the
actually device in disc/ and a superblock and
more.
etc.
I understand that Richard is already planning stuff like this for
devfs - /dev/volumes I believe.
Just to bring you back to where we are up to, this hierarchy is NOT
meant to replace /dev. It replaces the block-or-char/major/minor
hierarchy. Like that hierarchy, it has little in the way of access
control.
Though the bc/major/minor hierarchy is not directly accessible from
the filesystem, it would be nice if this hierarchy were. We could
mount it somewhere like /devices. However I would prefer it to go
somewhere like //devices or //linux/devices. It wouldn't get
mounted there. It would simply always be there, much as / is always
there and the bc/major/minor hierarchy is always ... wherever it
is.
It is true that linux doesn't currently differentiate // from /, but
POSIX allows us too, and there has been talk about going that way,
and we can keep that as a long term goal, and mount it in
/proc/devices or similar for now.
Also, as the devices in this tree have wide open permissions, if we
mount it, we must make sure that the root directory is closed -
owner root, permission 700. Probably access through the root should
require CAP_MKNOD as well so that it can safely be visible in chroot
gaols (presumably chroot changes / but not //).
C: A new construct to carry device special files into the next century.
We have a new device tree, but what good is it if no-one can access
it? That is equally true of the old bc/major/minor device
hierarchy. We need gateways into it.
To follow the pattern of device special files as outlined above, we
need some sort of filesystem object which has ownership and access
control, and contains a pointer to a device. However in this case
the pointer to the device is not a pair of numbers but is a textual
name. Sounds like a job for symlinks to me.
Obviously symlinks as they are don't cut it, but there are three
bits (setuid, setgid, sticky) that we can use to enhance symlinks -
and Unix has a (murky?) history of using these bits ... creatively.
Let me propose that a symlink with, say, the setuid bit gets treated
differently to symlinks, and somewhat like device special files.
- chmod/chown on such a symlink (lets call it a devlink) applies to
the symlink itself, and not on the target of the link.
- If we decide that the device tree doesn't get mounted, then we
could assert that such a devlink always points into the device
tree instead of the filesystem, but I would prefer to mount the
device tree at //devices, and have all the devlinks start
//devices.
- accessing a devlink (open) provides access to the referenced
object, and has access permissions checked based on the ACL of the
devlink.
- only root (CAP_MKNOD) can create devlinks, or set the setuid bit
on a symlink.
This provides essentially the functionality of device special
files, but with a more flexible hierarchy for devices. It allows us
to give away access to specific devices to specific users, and to
have this access stored in a normal filesystem and to be persistent
in the way that /dev normally is.
As devices appear in multiple places in the device tree under
multiple identities (by phys address, by uuid, by function) we can
(hopefully) give away the access that we really want to give away.
e.g. //devices/camera/xxx gives you access to digital camera with
uuid xxx no matter which buss it gets connected on.
However, this structure may be a bit too limiting. Suppose that
rather than giving away access to a specific device, I want to give
away access to a directory full of devices. e.g. You can have
access to any digital camera that gets plugged in. I really want to
be able to have a devlink that points to a directory. What does
that mean? In particular, how is the ACL carried along if I chdir
through a devlink, and how am I prevented from using ".." to walk
all over the device tree.
The abstraction that seems to work best here is a "mount". When I
access a devlink, particularly one to a directory, I want the
directory to effectively be mounted on the symlink ... and with Al's
new mount stuff there is no problem mounting different bits in
different places, possibly mounting the one directory in several
places. This provides control of "..", but what does it do for
preserving the ACL?
Here I think we need one more bit of magic.
Every object in the device tree will have ownership/ACL, though the
ACLs will be chosen from a fairly limited set and almost everything
will be owned by root. However, some objects, particularly devices,
will have a sticky bit set. In the device filesystem, the sticky bit
will have a special meaning. It means "use the owner/acl of the
mountpoint". The mountpoint is always available through the
vfsmount structure so getting hold of this should be quite easy.
One thing that this doesn't answer is how symlinks inside the device
filesystem get treated when you have only mounted part of it.
Possibly these symlink need to be devlinks as well. I haven't
completely resolved this issue for myself, but I don't think it is
insurmountable.
So, in summary, we have
- devlinks which are symlinks with setuid bit set.
- chown/chmod affect devlinks directly, not the target like
with symlinks.
- accessing a devlink does some sort of magic mount
- in the device filesystem, "sticky" objects get their permissions
from the mountpoint.
- only root(CAP_MKNOD) can make devlinks.
It might actually be nice to use devlinks more generally:
/usr -> //devices/filesystem/ext2fs/long-hex-uuid/fs
obviates some of the need for /etc/fstab.
D: Some notes on backward compatibility.
But what about major/minor numbers? Some applications (tar?) still
require them. And we need a clean transition from the old to the
new. My old /dev must still work while I am transitioning from the
old style to the new.
Making old device special files still work simply means providing a
mapping from cb/major/minor to //device/path. This could be
encoded into the kernel, or could be provided by making a directory
tree:
/oldevices/char/1,1 -> //proc/mem
/oldevices/char/5,0 -> //proc/self/controlling-tty
/oldevices/block/3,0 -> //devices/disc/0/device
....
and telling the kernel to look up this directory to resolve device
special files. Possibly a combination - the kernel "knows" about
many common things and goes to /oldevices for what it doesn't
understand - would be best. As it is a transition, speed shouldn't
be too important and caching could make up for any lack.
Providing major/minor numbers for programs that really want them is
not so straight forward. The best solution probably depends on what
real problems there turn out to be.
One idea would be that "mknod" on an existing "device" object would
set the major/minor numbers of that device, and some boot-time script
does the mknod for those devices that really need it.
Possibly devices which haven't been 'mknod'ed just appear to have
some automatically allocated unique major/minor number from an
unused number space. The number gets allocated the first time the
device is used.
E: Some closing comments.
As said, this proposal is very incomplete.
- The naming structure needs lots of thought by someone with lots
of relevant experience.
- Maybe devices should appear as device special files with
sequentially assigned numbers, or maybe they should appear to be
sockets.
- are the links in the device filesystem devlinks or symlinks, and
how costly would it be to have literally hundreds of little
mounts of the device filesystem.
- complete semantics of devlinks need to be worked out - e.g. when
something gets mounted on a devlink, can you still see the link,
and how do you chown/chmod it.
- there is no code
- what happens when a name is accessed in the device filesystem
that doesn't exist? Do we do a kmod callback or an autofs style
upcall. Do we cache negative responses?
- more..
There would still be a place for a devfsd like program, particularly
for taking arbitrary action on hot-plug/unplug events, though
probably for other things too. I would hope that the system could
still work reasonably well without devfsd running though.
This change would not be optional like devfs is as it substantially
changes the way devices are handled. This would imply that we would
want to be able to have a stripped down device filesystem for use in
embedded systems. How much stripping down would depend on how much
code it actually took to implement.
One issue that this proposal doesn't directly resolve is allowing
changes to permissions in /dev on a read-only root filesystem.
However it doesn't provide for file creation in /tmp on a read-only
root filesystem either:-) I like the 'copy /dev to a tmpfs, and then mount
that on /dev' approach. I'm sure that other approaches are possible
and could be "better", but I think the issue is separate from the
issue of how to represent devices.
F: My signature :-)
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Mon May 15 2000 - 21:00:17 EST