Re: [ANNOUNCE] block device interfaces changes

From: Alexander Viro (viro@math.psu.edu)
Date: Sat Jan 08 2000 - 15:10:08 EST


On Sat, 8 Jan 2000 Andries.Brouwer@cwi.nl wrote:

> From viro@math.psu.edu Sat Jan 8 04:49:23 2000
>
> Folks, there are changes underway in block device interface and
> some of them made it into 2.3.38.
>
> * New type (struct block_device) is defined. We have a cache of
> such objects, indexed by dev_t. struct block_device * is going to replace
> kdev_t for block devices. Handling of the cache is done in fs/block_dev.c
>
> Good!
> What do you mean by dev_t?

        That cache is populated upon foo_read_inode() (by
init_special_inode()) and the search key in this cache is (major+minor).
Notice that such searches should be done _only_ upon the interaction with
user / filesystem. Internally kernel should just use the pointer. Period.

> Don't forget to include the bdev/cdev bit in devno.]

        No, thanks. Character devices _are_ different; by coincidence we
are using 16bit integers to refer both to block and character devices, but
that's about it. Moreover, interface between block device and the kernel
is not a file one - fs/block_dev.c provides the glue that allows accessing
the block device as file, but that's none of device's business. It's main
interface is the request queue. Moreover, there is a notion of partitions
for block devices, and that notion doesn't belong to drivers. IOW, it's
VFS that makes character and block devices look similar. They have totally
different natures.

> * They have methods (struct block_device_operations). Currently
> the set is { open, release, ioctl, revalidate, check_media_change }. For
> now (and it's going to change) types are the same as in file_operations.
> However, in the near future they are going to become
> int (*open)(struct block_device *bdev, mode_t mode, unsigned flags);
> int (*release)(struct block_device *bdev);
> int (*ioctl)(struct block_device *bdev, unsigned cmd, unsigned
> long arg);
> int (*revalidate)(struct block_device *bdev);
> int (*check_media_change)(struct block_device *bdev);
>
> I needed a little more stuff.
> One thing is that the kernel does not need names for files,
> but it needs names for devices since it uses device names in
> printk. Thus you need something like
> char (*devname)(struct block_device *bdev);

        Umm... I'ld rather had that done statically, in register_...()
time. No need to breed complexity...

> Another thing is that I used two types, let us call them here
> major_struct and minor_struct, where a kdev_t was a pointer to
> a minor_struct, and the major_struct had a method
> kdev_t getdev(major_struct *dev, int minor);
> to retrieve a minor if it existed or create it if not.
> Probably this is part of your open, but it is not clear to me
> how you handle the difference between major_struct and minor_struct.
> They are rather different.

        major_struct is completely bogus. We don't have 1-1 correspondence
between majors and drivers. We don't have 1-1 correspondence between
majors and physical units (disks). So I think that correct way looks so:

        a) generic support for sets of embedded ranges. I.e. the node is a
range + set of non-overlapping subranges(==subnodes) + function + pointer
to data. We
have 4 functions:
range_lookup: node x value -> node
range_add: node x from x to x function x data -> subnode
range_delete: node ->
range_fit: node x range_size x function x data -> subnode.
        range_add() tries to add a subnode to a node. If the requested
range overlaps with any of existing subranges - too bad, -EBUSY. Otherwise
we are getting a pointer to new node.
        range_fit() does almost the same, except that instead of defining
range by its ends we just say what size we want.
        range_delete() - well, obvious.
        range_lookup() goes for the smallest range containing value (i.e.
goes down the tree), then if the node doesn't have associated function -
return the node, otherwise call the function and if it had lead to
creation of smaller subrange containing our value - go there and repeat
the process.
        Typically function associated with the node will do required
request_module(). E.g. root of the character device tree will have 'ask
for char-major-<upper 8 bits of dev_t>' here. And its subnode with range
0xa00--0xaff will have 'ask for char-major-10-<lower 8 bits>'.
        'Data' is opaque wrt to this beast, but the callers of
range_lookup() will know what to expect there. E.g. block device may
require a range for itself + subranges for disks. Said subranges will have
pointer to struct gendisk as a 'data'. OTOH per-driver range may have
block_device_operations + other per-driver stuff kept that way.

        Notice that this structure _is_ common for block and character
devices - it's simply a generic parser for dev_t-like data (i.e. a Shannon
tree...).

> A bug in Linux at present is that there are alias problems
> between block m of a partition and N+m of the entire disk.
> Thus one gets strange failures of fdisk and mkswap.
> (And the loop device is worse in this respect.)

        You are not going to get rid of them. Sorry. Page alignment can be
totally incompatible. Just Don't Do It (tm).
 
> Character devices are not affected at all - IMO using the same
> type both for block and character device was a mistake. So their handling
> remains as-is. Probably something should be done for them too, but that's
> completely different story.
>
> You follow Linus here, but I do not really agree.
> There is the interface with userspace, and that is completely
> identical in both cases. And kernel-internally you also want
> open(), release(), ioctl() (and devname()) for character devices.

First of all, I _don't_ _want_ ioctl(). The less of those we have, the
happier I will be. It was the worst idea in v7. Hell, they did it right in
_mid-80s_ in Plan 9. 15 years later we keep adding this mess. Ironic,
since we also have Plan 9-ish solution for that - pseudofs (in our case -
/proc)... One of the greatest ideas in UNIX is that everything is a stream
of characters, so WTF messing with the special interfaces for passing
commands to drivers? write() and read() are there for purpose... I would
consider ioctl() as a backwards compatibility kludge, in some cases
mandated by POSIX. And POSIX has _nothing_ with taste or elegance - it's a
codification of the worst Missed'em'V misfeatures. We may need to support
it, but going further than that...<shudder>

> My point of view would be more: there is this great unstructured
> mass of devices. The userspace interface is uniform, and part of
> the kernel stuff is uniform. But there is a subclass of that of

        Uniform? Let's see: open() and release() are common (under various
names) for next to any object in the kernel. ioctl() is, excuse me, a
barf-bag for all irregular stuff. So yes, both classes have irregular
pieces of interface. And they happen to have names. Great.
        About all uniformity that is there is that we have an external
representation of those with numbers and we have modules bound to
subranges of dev_t space. That's it. We need to be able to say 'I want
42, take care of modules' and we need to be able to say 'I'm responsible
for range 13--69, here's the structure you should give to anyone who'll
ask for values in that range'. _That_ is common, the rest... No.

-- 
If UNIX would be committee-driven it would combine conciseness of OS/360
with convenience of VMS, effectivity of Ada and clean syntax of COBOL.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jan 15 2000 - 21:00:13 EST