Re: [PATCH 0/3] dax: clear poison on the fly along pwrite
From: Dan Williams
Date: Fri Sep 17 2021 - 16:21:42 EST
On Fri, Sep 17, 2021 at 8:27 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Fri, Sep 17, 2021 at 01:53:33PM +0100, Christoph Hellwig wrote:
> > On Thu, Sep 16, 2021 at 11:40:28AM -0700, Dan Williams wrote:
> > > > That was my gut feeling. If everyone feels 100% comfortable with
> > > > zeroingas the mechanism to clear poisoning I'll cave in. The most
> > > > important bit is that we do that through a dedicated DAX path instead
> > > > of abusing the block layer even more.
> > >
> > > ...or just rename dax_zero_page_range() to dax_reset_page_range()?
> > > Where reset == "zero + clear-poison"?
> >
> > I'd say that naming is more confusing than overloading zero.
>
> How about dax_zeroinit_range() ?
Works for me.
>
> To go with its fallocate flag (yeah I've been too busy sorting out -rc1
> regressions to repost this) FALLOC_FL_ZEROINIT_RANGE that will reset the
> hardware (whatever that means) and set the contents to the known value
> zero.
>
> Userspace usage model:
>
> void handle_media_error(int fd, loff_t pos, size_t len)
> {
> /* yell about this for posterior's sake */
>
> ret = fallocate(fd, FALLOC_FL_ZEROINIT_RANGE, pos, len);
>
> /* yay our disk drive / pmem / stone table engraver is online */
The fallocate mode can still be error-aware though, right? When the FS
has knowledge of the error locations the fallocate mode could be
fallocate(fd, FALLOC_FL_OVERWRITE_ERRORS, pos, len) with the semantics
of attempting to zero out any known poison extents in the given file
range? At the risk of going overboard on new fallocate modes there
could also (or instead of) be FALLOC_FL_PUNCH_ERRORS to skip trying to
clear them and just ask the FS to throw error extents away.
> }
>
> > > > I'm really worried about both patartitions on DAX and DM passing through
> > > > DAX because they deeply bind DAX to the block layer, which is just a bad
> > > > idea. I think we also need to sort that whole story out before removing
> > > > the EXPERIMENTAL tags.
> > >
> > > I do think it was a mistake to allow for DAX on partitions of a pmemX
> > > block-device.
> > >
> > > DAX-reflink support may be the opportunity to start deprecating that
> > > support. Only enable DAX-reflink for direct mounting on /dev/pmemX
> > > without partitions (later add dax-device direct mounting),
> >
> > I think we need to fully or almost fully sort this out.
> >
> > Here is my bold suggestions:
> >
> > 1) drop no drop the EXPERMINTAL on the current block layer overload
> > at all
>
> I don't understand this.
>
> > 2) add direct mounting of the nvdimm namespaces ASAP. Because all
> > the filesystem currently also need the /dev/pmem0 device add a way
> > to open the block device by the dax_device instead of our current
> > way of doing the reverse
> > 3) deprecate DAX support through block layer mounts with a say 2 year
> > deprecation period
> > 4) add DAX remapping devices as needed
>
> What devices are needed? linear for lvm, and maybe error so we can
> actually test all this stuff?
The proposal would be zero lvm support. The nvdimm namespace
definition would need to grow support for concatenation + striping.
Soft error injection could be achieved by writing to the badblocks
interface.