On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
The call trace is like this:
memory_failure()
pgmap->ops->memory_failure() => pmem_pgmap_memory_failure()
gendisk->fops->block_lost() => pmem_block_lost() or
md_blk_block_lost()
sb->s_ops->storage_lost() => xfs_fs_storage_lost()
xfs_rmap_query_range()
xfs_storage_lost_helper()
mf_recover_controller->recover_fn => \
memory_failure_dev_pagemap_kill_procs()
The collect_procs() and kill_procs() are moved into a callback which
is passed from memory_failure() to xfs_storage_lost_helper(). So we
can call it when a file assocaited is found, instead of creating a
file list and iterate it.
The fsdax & reflink support for XFS is not contained in this patchset.
This looks promising - the overall architecture is a lot more
generic and less dependent on knowing about memory, dax or memory
failures. A few comments that I think would further improve
understanding the patchset and the implementation:
- the order of the patches is inverted. It should start with a
single patch introducing the mf_recover_controller structure for
callbacks, then introduce pgmap->ops->memory_failure, then
->block_lost, then the pmem and md implementations of ->block
list, then ->storage_lost and the XFS implementations of
->storage_lost.
- I think the names "block_lost" and "storage_lost" are misleading.
It's more like a "media failure" or a general "data corruption"
event at a specific physical location. The data may not be "lost"
but only damaged, so we might be able to recover from it without
"losing" anything. Hence I think they could be better named,
perhaps just "->corrupt_range"
- need to pass a {offset,len} pair through the chain, not just a
single offset. This will allow other types of devices to report
different ranges of failures, from a single sector to an entire
device.
- I'm not sure that passing the mf_recover_controller structure
through the corruption event chain is the right thing to do here.
A block device could generate this storage failure callback if it
detects an unrecoverable error (e.g. during a MD media scrub or
rebuild/resilver failure) and in that case we don't have PFNs or
memory device failure functions to perform.
IOWs, I think the action that is taken needs to be independent of
the source that generated the error. Even for a pmem device, we
can be using the page cache, so it may be possible to recover the
pmem error by writing the cached page (if it exists) back over the
pmem.
Hence I think that the recover function probably needs to be moved
to the address space ops, because what we do to recover from the
error is going to be dependent on type of mapping the filesystem
is using. If it's a DAX mapping, we call back into a generic DAX
function that does the vma walk and process kill functions. If it
is a page cache mapping, then if the page is cached then we can
try to re-write it to disk to fix the bad data, otherwise we treat
it like a writeback error and report it on the next
write/fsync/close operation done on that file.
This gets rid of the mf_recover_controller altogether and allows
the interface to be used by any sort of block device for any sort
of bottom-up reporting of media/device failures.
Cheers,
Dave.