Re: [RFC] Multi-path IO in 2.5/2.6 ?

From: James Bottomley (James.Bottomley@steeleye.com)
Date: Wed Sep 11 2002 - 14:37:40 EST


lmb@suse.de said:
> and if the device mapper could have the notion of "to what topology
> group does this device belong to", or even "distance metric (without
> going into further detail on what this is, as long as it is consistent
> to the physical layer) to the current CPU" (so that the shortest path
> in NUMA could be selected), that would be kinda cool ;-) And doesn't
> seem too intrusive.

I think I see driverfs as the solution here. Topology is deduced by examining
certain device and HBA parameters. As long as these parameters can be exposed
as nodes in the device directory for driverfs, a user level daemon map the
topology and connect the paths at the top. It should even be possible to
weight the constructed multi-paths.

This solution appeals because the kernel doesn't have to dictate policy, all
it needs to be told is what information it should be exposing and lets user
level get on with policy determination (this is a mini version of why we
shouldn't have network routing policy deduced and implemented by the kernel).

> But one of the issues with the md layer right now for example is the fact that
> an error on a backup path will only be noticed as soon as it actually tries to
> use it, even if the cpqfc (to name the culprit, worst code I've seen in a
> while) notices a link-down event. It isn't communicated anywhere, and how
> should it be?

I've been think about this separately. FC in particular needs some type of
event notification API (something like "I've just seen this disk" or "my loop
just went down"). I'd like to leverage a mid-layer api into hot plug for some
of this, but I don't have the details worked out.

> If the path was somehow exposed to user-space (as it is with md right now),
> the reprobing could even be done outside the kernel. This seems to make sense,
> because a potential extensive device-specific diagnostic doesn't have to be
> folded into it then.

The probing issue is an interesting one. At least SCSI has the ability to
probe with no IO (something like a TEST UNIT READY) and I assume other block
devices have something similar. Would it make sense to tie this to a single
well known ioctl so that you can probe any device that supports it without
having to send real I/O?

> The other features on my list - prioritizing paths, a useful, consistent user
> interface via driverfs/devfs/procfs - are more or less policy I guess.

Mike Sullivan (of IBM) is working with Patrick Mochel on this.

James

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 15 2002 - 22:00:26 EST