Re: [PATCH v11 1/8] dax: Introduce holder for dax_device

From: Dan Williams
Date: Thu Apr 07 2022 - 21:38:24 EST


[ add Mauro and Tony for RAS discussion ]

On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote:
> > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@xxxxxxxxxx> wrote:
> > >
> > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote:
> > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote:
> > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote:
> > > >>> As the code I pasted before, pmem driver will subtract its ->data_offset,
> > > >>> which is byte-based. And the filesystem who implements ->notify_failure()
> > > >>> will calculate the offset in unit of byte again.
> > > >>>
> > > >>> So, leave its function signature byte-based, to avoid repeated conversions.
> > > >>
> > > >> I'm actually fine either way, so I'll wait for Dan to comment.
> > > >
> > > > FWIW I'd convinced myself that the reason for using byte units is to
> > > > make it possible to reduce the pmem failure blast radius to subpage
> > > > units... but then I've also been distracted for months. :/
> > > >
> > >
> > > Yes, thanks Darrick! I recall that.
> > > Maybe just add a comment about why byte unit is used?
> >
> > I think we start with page failure notification and then figure out
> > how to get finer grained through the dax interface in follow-on
> > changes. Otherwise, for finer grained error handling support,
> > memory_failure() would also need to be converted to stop upcasting
> > cache-line granularity to page granularity failures. The native MCE
> > notification communicates a 'struct mce' that can be in terms of
> > sub-page bytes, but the memory management implications are all page
> > based. I assume the FS implications are all FS-block-size based?
>
> I wouldn't necessarily make that assumption -- for regular files, the
> user program is in a better position to figure out how to reset the file
> contents.
>
> For fs metadata, it really depends. In principle, if (say) we could get
> byte granularity poison info, we could look up the space usage within
> the block to decide if the poisoned part was actually free space, in
> which case we can correct the problem by (re)zeroing the affected bytes
> to clear the poison.
>
> Obviously, if the blast radius hits the internal space info or something
> that was storing useful data, then you'd have to rebuild the whole block
> (or the whole data structure), but that's not necessarily a given.

tl;dr: dax_holder_notify_failure() != fs->notify_failure()

So I think I see some confusion between what DAX->notify_failure()
needs, memory_failure() needs, the raw information provided by the
hardware, and the failure granularity the filesystem can make use of.

DAX and memory_failure() need to make immediate page granularity
decisions. They both need to map out whole pages (in the direct map
and userspace respectively) to prevent future poison consumption, at
least until the poison is repaired.

The event that leads to a page being failed can be triggered by a
hardware error as small as an individual cacheline. While that is
interesting to a filesystem it isn't information that memory_failure()
and DAX can utilize.

The reason DAX needs to have a callback into filesystem code is to map
the page failure back to all the processes that might have that page
mapped because reflink means that page->mapping is not sufficient to
find all the affected 'struct address_space' instances. So it's more
of an address-translation / "help me kill processes" service than a
general failure notification service.

Currently when raw hardware event happens there are mechanisms like
arch-specific notifier chains, like powerpc::mce_register_notifier()
and x86::mce_register_decode_chain(), or other platform firmware code
like ghes_edac_report_mem_error() that uplevel the error to a coarse
page granularity failure, while emitting the fine granularity error
event to userspace.

All of this to say that the interface to ask the fs to do the bottom
half of memory_failure() (walking affected 'struct address_space'
instances and killing processes (mf_dax_kill_procs())) is different
than the general interface to tell the filesystem that memory has gone
bad relative to a device. So if the only caller of
fs->notify_failure() handler is this code:

+ if (pgmap->ops->memory_failure) {
+ rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE,
+ flags);

...then you'll never get fine-grained reports. So, I still think the
DAX, pgmap and memory_failure() interface should be pfn based. The
interface to the *filesystem* ->notify_failure() can still be
byte-based, but the trigger for that byte based interface will likely
need to be something driven by another agent. Perhaps like rasdaemon
in userspace translating all the arch specific physical address events
back into device-relative offsets and then calling a new ABI that is
serviced by fs->notify_failure() on the backend.