2nd draft README for devfs (long)

Richard Gooch (rgooch@atnf.CSIRO.AU)
Fri, 9 Jan 1998 17:02:46 +1100


Hi, all. After reading the various messages flying around on this
topic and doing some more thinking, I've updated the justification
section of my README for devfs.

Basically, I think the days of major and minor numbers are numbered
(pun unintended). I don't mean 8 bit major&minors are doomed. I mean
the whole concept is doomed. Increasing these to 16 bits each is merly
a kludge and what's worse it doesn't scale. Either you chew heaps of
RAM or you scan lists. See below for more detail.

Also, for those interested in seeing what the interface currently
looks like, I've appended an extract from the source. Soon I will be
posting a kernel patch that people can start playing with.

Regards,

Richard....
===============================================================================
void *dev_register (unsigned int major, unsigned int minor,
umode_t mode, uid_t uid, gid_t gid,
const char *name, unsigned int namelen,
struct file_operations *fops, int auto_owner)
/* [SUMMARY] Register a device entry.
<major> The major number. Not needed for regular files.
<minor> The minor number. Not needed for regular files.
<mode> The default file mode.
<uid> The default UID of the file.
<guid> The default GID of the file.
<name> The name of the entry.
<namelen> The number of characters in <<name>>.
<fops> The file_operations structure. This must not be externally
deallocated.
<auto_owner> If TRUE then when an closed inode is opened the ownerships are
set to the opening process and the protection is set to that given in
<<mode>>. When the inode is closed, ownership reverts back to <<uid>> and
<<gid>> and the protection is set to read-write for all.
[RETURNS] A handle which may later be used in a call to [<dev_unregister>].
On failure NULL is returned.
*/

void dev_unregister (void *handle, const char *name, unsigned int namelen,
unsigned int major, unsigned int minor)
/* [SUMMARY] Unregister a device entry.
<handle> A handle previously created by [<dev_register>]. If this is NULL
then the list of devices must be searched.
<name> The name of the entry. This is ignored if <<handle>> is not NULL.
<namelen> The number of characters in <<name>>.
<major> The major number. This is used if <<handle>> and <<name>> are NULL.
<minor> The minor number. This is used if <<handle>> and <<name>> are NULL.
[RETURNS] Nothing.
*/
===============================================================================
Device File System (devfs) Overview

Richard Gooch <rgooch@atnf.csiro.au>

9-JAN-1998

What is it?
===========

Devfs is an alternative to "real" character and block special devices
on your root filesystem. Kernel device drivers can register devices by
name rather than major and minor numbers. These devices will appear in
the devfs automatically, with whatever default ownership and
protection the driver specified.

Why do it?
==========

There are several problems that devfs addresses. Some of these
problems are more serious than others (depending on your point of
view), and some can be solved without devfs. However, the totality of
these problems really calls out for devfs.

Major&minor allocation
----------------------
The existing scheme requires the allocation of major and minor device
numbers for each and every device. This means that a central
co-ordinating authority is required to issue these device numbers
(unless you're developing a "private" device driver), in order to
preserve uniqueness. Devfs shifts the burden to a namespace. This may
not seem like a huge benefit, but actually it is. Since driver authors
will naturally choose a device name which reflects the functionality
of the device, there is far less potential for namespace conflict.
Solving this requires a kernel change.

/dev management
---------------
Because you currently access devices through device nodes, these must
be created by the system administrator. For standard devices you can
usually find a MAKEDEV programme which creates all these (hundreds!)
of nodes. This means that changes in the kernel must be reflected by
changes in the MAKEDEV programme, or else the system administrator
creates device nodes by hand.
The basic problem is that there are two separate databases of
major and minor numbers. One is in the kernel and one is in /dev (or
in a MAKEDEV programme, if you want to look at it that way).
Solving this requires a kernel change.

/dev growth
-----------
I maintain a subset of the common /dev nodes, and I have nearly 600!
Others have twice this number. Most of these devices simply don't
exist because the hardware is not available. A huge /dev increases the
time to access devices (I'm just referring to the dentry lookup times
here: the next section shows some more horrors).
An example of how big /dev can grow is if we consider SCSI devices:
bus 4 bits
unit 8 bits
LUN 8 bits
partition 6 bits
TOTAL 26 bits
This requires 64 Mega (1024*1024) inodes if we want to store all
possible device nodes. Even if we scrap different units and LUNs,
that's still 10 bits or 1024 inodes. Each VFS inode takes around 256
bytes (kernel 2.1.78), so that's 256 kBytes of inode storage!
This could be solved in user-space using a clever programme which
scanned the kernel logs and deleted /dev entries which are not
available and created them when they were available. This programme
would need to be run every time a new module was loaded, which would
slow things down a lot. Devfs is much cleaner.

Node to driver file_operations translation
------------------------------------------
There is an important difference between the way disc-based c&b nodes
and devfs make the connection between an entry in /dev and the actual
device driver.

With the current 8 bit major and minor numbers the connection between
disc-based c&b nodes and per-major drivers is done through a
fixed-length table of 128 entries. The various filesystem types set
the inode operations for c&b nodes to {chr,blk}dev_inode_operations,
so when a device is opened a few quick levels of indirection bring us
to the driver file_operations.

For miscellaneous character devices a second step is required: there
is a scan for the driver entry with the same minor number as the file
that was opened, and the appropriate minor open method is called. This
scanning is done *every time* you open a device node. Potentially, you
may be searching through dozens of misc. entries before you find your
open method.

Linux *must* move beyond the 8 bit major and minor barrier,
somehow. If we simply increase each to 16 bits, then the indexing
scheme used for major driver lookup becomes untenable, because the
major tables (one each for character and block devices) would need to
be 64 k entries long (512 kBytes on x86, 1 Mbyte for 64 bit
systems). So we would have to use a scheme like that used for
miscellaneous character devices, which means the search time goes up
linearly with the average number of major device drivers.

Note that the devfs doesn't use the major&minor system. For devfs
entries, the connection is done when you lookup the /dev entry. When
dev_register() is called, an internal table is appended which has the
entry name and the file_operations. If the dentry cache doesn't have
the /dev entry already, this internal table is scanned to get the
file_operations, and an inode is created. If the dentry cache already
has the entry, there is *no lookup time* (other than the dentry scan
itself, but we can't avoid that anyway, and besides Linux dentries
cream other OS'es which don't have them:-). Furthermore, the number of
node entries in a devfs is only the number of available device
entries, not the number of *conceivable* entries. Even if you remove
unnecessary entries in a disc-based /dev, the number of conceivable
entries remains the same.
Devfs provides a fast connection between a VFS node and the device
driver, in a scalable way.

/dev as a system administration tool
------------------------------------
Right now /dev contains a list of conceivable devices, most of which I
don't have. A devfs would only show those devices available on my
system. This means that listing /dev would be a handy way of checking
what devices were available.

Major&minor size
----------------
Existing major and minor numbers are limited to 8 bits each. This is
now a limiting factor for some drivers, particularly the SCSI disc
driver, which consumes a single major number. Only 16 discs are
supported, and each disc may have only 15 partitions. Maybe this isn't
a problem for you, but some of us are building huge Linux systems with
disc arrays.
Solving this requires a kernel change.

Readonly root filesystem
------------------------
Having your device nodes on the root filesystem means that you can't
operate properly with a read-only root filesystem. This is because you
want to change ownerships and protections of tty devices. Existing
practice prevents you using a CD-ROM as your root filesystem for a
*real* system. Sure, you can boot off a CD-ROM, but you can't change
tty ownerships, so it's only good for installing.
Also, you can't use a shared NFS root filesystem for a cluster of
discless Linux machines (having tty ownerships changed on a common
/dev is not good). Nor can you embed your root filesystem in a
ROM-FS.
You can get around this by creating a RAMDISC at boot time, making
an ext2 filesystem in it, mounting it somewhere and copying the
contents of /dev into it, then unmounting it and mounting it over
/dev. A devfs is a cleaner way of solving this.

Non-Unix root filesystem
------------------------
Non-Unix filesystems (such as NTFS) can't be used for a root
filesystem because they variously don't support character and block
special files or symbolic links. You can't have a separate disc-based
or RAMDISC-based filesystem mounted on /dev because you need device
nodes before you can mount. Devfs can be mounted without any device
nodes.
Solving this requires devfs.

PTY security
------------
Current pseudo-tty (pty) devices are owned by root and read-writable
by everyone. The user of a pty-pair cannot change
ownership/protections without being suid-root.
This could be solved with a secure user-space daemon which runs as
root and does the actual creation of pty-pairs. Such a daemon would
require modification to *every* programme that wants to use this new
mechanism. It also slows down creation of pty-pairs.
An alternative is to create a new open_pty() syscall which does much
the same thing as the user-space daemon. Once again, this requires
modifications to pty-handling programmes.
The devfs solution would allow a device driver to "tag" certain device
files so that when an unopened device is opened, the ownerships are
changed to the current euid and egid of the opening process, and the
protections are changed to the default registered by the driver. When
the device is closed ownership is set back to root and protections are
set back to read-write for everybody. No programme need be changed.