Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

From: Dan Williams
Date: Tue Apr 26 2016 - 13:16:30 EST


On Tue, Apr 26, 2016 at 8:31 AM, Jan Kara <jack@xxxxxxx> wrote:
> On Tue 26-04-16 07:59:10, Dan Williams wrote:
>> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
>> [..]
>> > It seems to me you are focussing on code/technologies that exist
>> > today instead of trying to define an architecture that is more
>> > optimal for pmem storage systems. Yes, working code is great, but if
>> > you can't tell people how things like robust error handling and
>> > redundancy are going to work in future then it's going to take
>> > forever for everyone else to handle such errors robustly through the
>> > storage stack...
>>
>> Precisely because higher order redundancy is built on top this baseline.
>>
>> MD-RAID can't do it's error recovery if we don't have -EIO and
>> clear-error-on-write. On the other hand, you're absolutely right that
>> we have a gaping hole on top of the SIGBUS recovery model, and don't
>> have a kernel layer we can interpose on top of DAX to provide some
>> semblance of redundancy.
>>
>> In the meantime, a handful of applications with a team of full-time
>> site-reliability-engineers may be able to plug in external redundancy
>> infrastructure on top of what is defined in these patches. For
>> everyone else, the hard problem, we need to do a lot more thinking
>> about a trap and recover solution.
>
> So we could actually implement some kind of redundancy with DAX with
> reasonable effort. We already do track dirty storage PFNs in the radix
> tree. After DAX locking patches get merged we also have a reliable way to
> write-protect them when we decide to do 'writeback' (translates to flushing
> CPU caches) for them. When we do that, we have all the infrastructure in
> place to provide 'stable pages' while some mirroring or other redundancy
> mechanism in kernel works with the data.
>
> But as Dave said, we should do some writeup of how this is all supposed to
> work and e.g. which layer is going to be responsible for the redundancy. Do
> we want to have that in DAX code? Or just provide stable page guarantees
> from DAX and do the redundancy from device mapper? This needs more
> thought...
>

[ adding Mike, since his ears are likely burning by this point ]

If we had the ability to specify a range or list of ranges to
blkdev_issue_flush() that would allow the driver level to implement
redundancy at sync time. And no, before someone flies off the handle,
this isn't rehashing the same argument I lost about where to track
dirty pfns. Rather this relies on the radix to track dirty pfns, but
asks the driver to do the flush operation. In the nominal case this
is a clflush / clwb loop or wbinvd in the pmem driver, in the
redundancy case the pmem driver is swapped out for a driver that uses
the flush request as a trigger point to synchronize redundant data.

We want this at the driver level to take advantage of standard
asynchronous completions, and make it administratively equivalent to
the dm/md layering people are used to using.