[patch] block device stacking support, raid-2.3.47-B6

From: Ingo Molnar (mingo@redhat.com)
Date: Wed Feb 23 2000 - 12:08:13 EST


Heinz, Andrea, Linus,

various ideas/patches regarding block device stacking support were
floating around in the last couple of days, here is a patch against
vanilla 2.3.47 that solves both RAID's and LVM's needs sufficiently:

        http://www.redhat.com/~mingo/raid-patches/raid-2.3.47-B6

(also attached) Andrea's patch from yesterday touches some of the issues
but RAID has different needs wrt. ->make_request():

- RAID1 and RAID5 needs truly recursive ->make_request() stacking because
  the relationship between the request-bh and the IO-bh is not 1:1. In the
  case of RAID0/linear and LVM the mapping is 1:1, so no on-stack
  recursion is necessery.

- re-grabbing the device queue in generic_make_request() is necessery,
  just think of RAID0+LVM stacking.

- IO-errors have to be initiated in the layer that notices them.

- i dont agree with moving the ->make_request() function to be
  a per-major thing, in the (near) future i'd like to implement RAID
  personalities via several sub-queues of a single RAID-blockdevice,
  avoiding the current md_make_request internal step completely.

- renaming ->make_request_fn() to ->logical_volume_fn is both misleading
  and unnecessery.

i've added the good bits (i hope i found all of them) from Andrea's patch
as well: the end_io() fix in md.c, the ->make_request() change returning
IO errors, and avoiding an unnecessery get_queue() in the fast path.

the patch changes blkdev->make_request_fn() semantics, but these work
pretty well both for RAID0, LVM & RAID1/RAID5:

  (bh->b_dev, bh->b_blocknr) => just like today, never modified, this is
                                the 'physical index' of the buffer-cache.

  internally any special ->make_request() function is forbidden to access
  b_dev and b_blocknr too, b_rdev and b_rsector has to be used.
  ll_rw_block() correctly installs an identity mapping first, and all
  stacked devices just iterate one more step.

  bh->b_rdev: the 'current target device'
  bh->b_rsector: the 'current target sector'

  the return values of ->make_request_fn():
        ret == 0: dont continue iterating and dont submit IO
        ret > 0: continue iterating
        ret < 0: IO error (already handled by the layer which noticed it)

  we explicitly rely on ll_rw_blk getting the BH_Lock and not calling
  ->make_request() on this bh more than once.

with these semantics all the variations are possible, it's up to the
device to use the one it likes best:

 - device resolves one mapping step and returns 1 (RAID0, LVM)

 - device calls generic_make_request() and return 1 (RAID1, RAID5)

 - device resolves recursion internally and returns 0 (future RAID0),
          returns 1 if recursion cannot be resolved internally.

generic_make_request() returns 0 if it has submitted IO - thus
generic_make_request() can also be used as a queue's ->make_request_fn()
function - it's completely symmetric. (not that anyone would want to do
this)

NOTE: a device might still resolve stacking internally, if it can. Eg. the
next version of raid0.c will do a while loop internally if we map
RAID0->RAID0. The performance advantage is obvious: no indirect function
calls and no get_queue(). LVM could do the same as well.

(the patch modifies lvm.c to reflect these new semantics, to not rely on
b_dev and b_blocknr and to not call generic_make_request(), and fixes the
lvm.c hack avoiding MD<->LVM stacking. These changes are untested.)

with this method it was pretty straightforward to add stacked RAID0 and
linear device support, here is a sample RAID0+RAID0 => RAID0 stacking:

        [root@moon /root]# cat /proc/mdstat
        Personalities : [linear] [raid0]
        read_ahead 1024 sectors
        md2 : active raid0 mdb[1] mda[0]
              1661472 blocks 4k chunks

        md1 : active raid0 sdf1[1] sde1[0]
              830736 blocks 4k chunks

        md0 : active raid0 sdd1[1] sdc1[0]
              830736 blocks 4k chunks

        unused devices: <none>
        [root@moon /root]# df /mnt
        Filesystem 1k-blocks Used Available Use% Mounted on
        /dev/md2 1607473 13 1524387 0% /mnt

The LVM changes are not tested. The RAID0/linear changes compile/boot/work
just fine and are reasonably well-tested and understood.

any objections?

        Ingo



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Feb 23 2000 - 21:00:33 EST