Re: [PATCH 0/3] Provide more fine grained control over multipathing
From: Martin K. Petersen
Date: Fri Jun 01 2018 - 10:10:28 EST
Good morning Mike,
> This notion that only native NVMe multipath can be successful is utter
> bullshit. And the mere fact that I've gotten such a reaction from a
> select few speaks to some serious control issues.
Please stop making this personal.
> Imagine if XFS developers just one day imposed that it is the _only_
> filesystem that can be used on persistent memory.
It's not about project X vs. project Y at all. This is about how we got
to where we are today. And whether we are making right decisions that
will benefit our users in the long run.
20 years ago there were several device-specific SCSI multipath drivers
available for Linux. All of them out-of-tree because there was no good
way to consolidate them. They all worked in very different ways because
the devices themselves were implemented in very different ways. It was a
nightmare.
At the time we were very proud of our block layer, an abstraction none
of the other operating systems really had. And along came Ingo and
Miguel and did a PoC MD multipath implementation for devices that didn't
have special needs. It was small, beautiful, and fit well into our shiny
block layer abstraction. And therefore everyone working on Linux storage
at the time was convinced that the block layer multipath model was the
right way to go. Including, I must emphasize, yours truly.
There were several reasons why the block + userland model was especially
compelling:
1. There were no device serial numbers, UUIDs, or VPD pages. So short
of disk labels, there was no way to automatically establish that block
device sda was in fact the same LUN as sdb. MD and DM were existing
vehicles for describing block device relationships. Either via on-disk
metadata or config files and device mapper tables. And system
configurations were simple and static enough then that manually
maintaining a config file wasn't much of a burden.
2. There was lots of talk in the industry about devices supporting
heterogeneous multipathing. As in ATA on one port and SCSI on the
other. So we deliberately did not want to put multipathing in SCSI,
anticipating that these hybrid devices might show up (this was in the
IDE days, obviously, predating libata sitting under SCSI). We made
several design compromises wrt. SCSI devices to accommodate future
coexistence with ATA. Then iSCSI came along and provided a "cheaper
than FC" solution and everybody instantly lost interest in ATA
multipath.
3. The devices at the time needed all sorts of custom knobs to
function. Path checkers, load balancing algorithms, explicit failover,
etc. We needed a way to run arbitrary, potentially proprietary,
commands from to initiate failover and failback. Absolute no-go for the
kernel so userland it was.
Those are some of the considerations that went into the original MD/DM
multipath approach. Everything made lots of sense at the time. But
obviously the industry constantly changes, things that were once
important no longer matter. Some design decisions were made based on
incorrect assumptions or lack of experience and we ended up with major
ad-hoc workarounds to the originally envisioned approach. SCSI device
handlers are the prime examples of how the original transport-agnostic
model didn't quite cut it. Anyway. So here we are. Current DM multipath
is a result of a whole string of design decisions, many of which are
based on assumptions that were valid at the time but which are no longer
relevant today.
ALUA came along in an attempt to standardize all the proprietary device
interactions, thus obsoleting the userland plugin requirement. It also
solved the ID/discovery aspect as well as provided a way to express
fault domains. The main problem with ALUA was that it was too
permissive, letting storage vendors get away with very suboptimal, yet
compliant, implementations based on their older, proprietary multipath
architectures. So we got the knobs standardized, but device behavior was
still all over the place.
Now enter NVMe. The industry had a chance to clean things up. No legacy
architectures to accommodate, no need for explicit failover, twiddling
mode pages, reading sector 0, etc. The rationale behind ANA is for
multipathing to work without any of the explicit configuration and
management hassles which riddle SCSI devices for hysterical raisins.
My objection to DM vs. NVMe enablement is that I think that the two
models are a very poor fit (manually configured individual block device
mapping vs. automatic grouping/failover above and below subsystem
level). On top of that, no compelling technical reason has been offered
for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs
or IQNs into multipath.conf to get things working. And there is no flag
day/transition path requirement for devices that (with very few
exceptions) don't actually exist yet.
So I really don't understand why we must pound a square peg into a round
hole. NVMe is a different protocol. It is based on several decades of
storage vendor experience delivering products. And the protocol tries to
avoid the most annoying pitfalls and deficiencies from the SCSI past. DM
multipath made a ton of sense when it was conceived, and it continues to
serve its purpose well for many classes of devices. That does not
automatically imply that it is an appropriate model for *all* types of
devices, now and in the future. ANA is a deliberate industry departure
from the pre-ALUA SCSI universe that begat DM multipath.
So let's have a rational, technical discussion about what the use cases
are that would require deviating from the "hands off" aspect of ANA.
What is it DM can offer that isn't or can't be handled by the ANA code
in NVMe? What is it that must go against the grain of what the storage
vendors are trying to achieve with ANA?
--
Martin K. Petersen Oracle Linux Engineering