Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

From: Mathieu Desnoyers
Date: Fri Feb 02 2024 - 14:29:22 EST

Next message: Lukas Wunner: "Re: [PATCH AUTOSEL 6.7 20/23] ARM: dts: Fix TPM schema violations"
Previous message: Pasha Tatashin: "[PATCH v2] iommu/iova: use named kmem_cache for iova magazines"
In reply to: Dan Williams: "Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime"
Next in thread: Dan Williams: "Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2024-02-02 12:37, Dan Williams wrote:

Mathieu Desnoyers wrote:

[...]

The alternative route I intend to take is to audit all callers
of alloc_dax() and make sure they all save the alloc_dax() return
value in a struct dax_device * local variable first for the sake
of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
initialized to NULL in the error case and simplify the rest of
error checking.

I could maybe get on board with that, but it needs a comment somewhere
about the asymmetric subtlety.

Is this "somewhere" at every alloc_dax() call site, or do you have
something else in mind ?

return;
if (dax_dev->holder_data != NULL)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..b69c9e442cf4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
dax_dev = alloc_dax(pmem, &pmem_dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
- goto out;
+ if (rc != -EOPNOTSUPP)
+ goto out;

If I compare the before / after this change, if previously
pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
result in a NULL pointer dereference.

No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the
CONFIG_FS_DAX=n case.

Indeed, I was wrong there.

So that means that pmem devices on ARM have been
possible without FS_DAX. So, in order for alloc_dax() returning
ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error
path needs to be changed.

Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX
Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y,
which includes configurations that had FS_DAX=n previously.

I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP
alloc_dax() return value as non-fatal.

This would be an error handling fix all by itself. Do we really want
to return successfully if dax is unsupported, or should we return
an error here ?

Per above, there is no error handling fix, and pmem block device
available should not depend on alloc_dax() succeeding.

I agree on treating alloc_dax() failure as non-fatal. There is
however one error handling fix to nvdimm/pmem which I plan to
introduce as an initial patch before this change:

nvdimm/pmem: Fix leak on dax_add_host() failure
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:
out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+ pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
- pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;

The real question is what to do about device-dax. I *think* it is not
affected by cpu_dcache aliasing because it never accesses user mappings
through a kernel alias. I doubt device-dax is in use on these platforms,
but we might need another fixup for that if someone screams about the
alloc_dax() behavior change making them lose device-dax access.

By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.

Based on your analysis, is alloc_dax() still the right spot where
to place this runtime check ? Which call sites are responsible
for invoking alloc_dax() for device-dax ?

If we know which call sites do not intend to use the kernel linear
mapping, we could introduce a flag (or a new variant of the alloc_dax()
API) that would either enforce or skip the check.

[...]

Here what I'm seeing so far:

- devm_release_mem_region() is never called after devm_request_mem_region(). Not
on error, neither on teardown,

devm_release_mem_region() is called from virtio_fs_probe() context. That

I guess you mean "devm_request_mem_region()" here.

means that when virtio_fs_probe() returns an error the driver core will
automatically call devm_request_mem_region().

And "devm_release_mem_region()" here.

- pgmap is never freed on error after devm_kzalloc.

That is what the "devm_" in devm_kzalloc() does, free the memory on
driver-probe failure, or after the driver remove callback is invoked.

Got it.

{
+ struct dax_device *dax_dev __free(cleanup_dax) = NULL;
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
+ dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
+ if (IS_ERR(dax_dev)) {
+ int rc = PTR_ERR(dax_dev);
+
+ if (rc == -EOPNOTSUPP)
+ return 0;
+ return rc;
+ }

What is gained by moving this allocation here ?

The gain is to fail early in virtio_fs_setup_dax() since the fundamental
dependency of alloc_dax() success is not met. For example why let the
setup progress to devm_memremap_pages() when alloc_dax() is going to
return ERR_PTR(-EOPNOTSUPP).

What I don't know is whether there is a dependency requiring to do
devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages()
before calling alloc_dax() ?

Those 3 calls are used to populate:

fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
fs->window_len = (phys_addr_t) cache_reg.len;

and then alloc_dax() takes "fs" as private data parameter. So it's
unclear to me whether we can swap the invocation order. I suspect
that it is not an issue because it is only used to populate
dax_dev->private, but I prefer to confirm this with you just to be
on the safe side.

[...]

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Next message: Lukas Wunner: "Re: [PATCH AUTOSEL 6.7 20/23] ARM: dts: Fix TPM schema violations"
Previous message: Pasha Tatashin: "[PATCH v2] iommu/iova: use named kmem_cache for iova magazines"
In reply to: Dan Williams: "Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime"
Next in thread: Dan Williams: "Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]