Re: [RFC PATCH] dax: add badblocks check to Device DAX

From: Dan Williams
Date: Wed May 03 2017 - 18:52:28 EST


On Wed, May 3, 2017 at 3:41 PM, Kani, Toshimitsu <toshi.kani@xxxxxxx> wrote:
> On Wed, 2017-05-03 at 14:48 -0700, Dan Williams wrote:
>> On Wed, May 3, 2017 at 11:46 AM, Kani, Toshimitsu <toshi.kani@xxxxxxx
>> > wrote:
>> > On Wed, 2017-05-03 at 09:30 -0700, Dan Williams wrote:
>> > > On Wed, May 3, 2017 at 9:09 AM, Kani, Toshimitsu <toshi.kani@hpe.
>> > > com>
>> > > wrote:
>> > > > On Wed, 2017-05-03 at 08:52 -0700, Dan Williams wrote:
>> > > > > On Wed, May 3, 2017 at 8:31 AM, Toshi Kani <toshi.kani@xxxxxx
>> > > > > m>
>> > > > > wrote:
>> > > > > > This is a RFC patch for seeking suggestions. It adds
>> > > > > > support of badblocks check in Device DAX by using region-
>> > > > > > level badblocks list. This patch is only briefly tested.
>> > > > > >
>> > > > > > device_dax is a well-isolated self-contained module as it
>> > > > > > calls alloc_dax() with dev_dax, which is private to
>> > > > > > device_dax. For checking badblocks, it needs to call
>> > > > > > dax_pmem to check with region-level badblocks.
>> > > > > >
>> > > > > > This patch attempts to keep device_dax self-contained. It
>> > > > > > adds check_error() to dax_operations, and dax_check_error()
>> > > > > > as a stub with *dev_dax and *dev pointers to convey it to
>> > > > > > dax_pmem. I am wondering if this is the right direction,
>> > > > > > or we should change the modularity to let dax_pmem call
>> > > > > > alloc_dax() with its dax_pmem (or I completely missed
>> > > > > > something).
>> > > > >
>> > > > > The problem is that device-dax guarantees a given fault
>> > > > > granularity. To make that guarantee we can't fallback from 1G
>> > > > > or 2M mappings due to an error. We also can't reasonably go
>> > > > > the other way and fail mappings that contain a badblock
>> > > > > because that would change the blast radius of a media error
>> > > > > to the fault size.
>> > > >
>> > > > Does it mean we expect users to have CPUs with MCE recovery for
>> > > > Device DAX? Can we add an attributes like allow error-check &
>> > > > fall-back?
>> > >
>> > > Yes, without MCE recovery device-dax mappings that consume errors
>> > > will reboot. If an application needs the kernel protection it
>> > > should be using filesystem-dax.
>> >
>> > Understood. Are we going to provide sysfs "badblocks" for Device
>> > DAX as it is also needed for ndctl clear-error?
>>
>> No, I had started that way, but badblocks really needs write(2) or
>> fallocate(PUNCH_HOLE) support for clearing errors. Since we don't
>> want to support write(2) and were NAKd from supporting fallocate()
>> the only interface that was left was sending clear-error-DSM ioctls
>> directly to the nvdimm bus. Since that is a very libnvdimm specific
>> interface it made sense to then add badblocks at the libnvdimm-region
>> level. The "ndctl clear-error" command is there to do the translation
>> of error offsets in user space and supersedes the need for the kernel
>> to carry a badblocks file for device-dax.
>
> I am fine with using ndctl to clear errors. What I need is to allow an
> application to avoid accessing to bad blocks by reading a sysfs file
> and managing the bad blocks list by itself since the kernel does not
> protect it at page faults. At least, data offset of Device DAX should
> be provided for such application to do the translation by itself.

I believe we already have all the data needed to calculate the data
offset. Given the following sysfs path:

/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region1/dax1.1/dax/dax1.0

...we can find the associated namespace device from that dax1.1. From
there we have the base address of the namespace and the size
device-dax instance.

device_dax_data_offset == namespace_base + namespace_size - device_dax_size