Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices

From: Mike Snitzer
Date: Tue Jun 14 2016 - 22:35:12 EST


On Tue, Jun 14 2016 at 10:07pm -0400,
Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> On Tue, Jun 14, 2016 at 6:46 PM, Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > On Tue, Jun 14 2016 at 4:19pm -0400,
> > Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> >
> >> Mike Snitzer <snitzer@xxxxxxxxxx> writes:
> >>
> >> > On Tue, Jun 14 2016 at 9:50am -0400,
> >> > Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> >> >
> >> >> "Kani, Toshimitsu" <toshi.kani@xxxxxxx> writes:
> >> >>
> >> >> >> I had dm-linear and md-raid0 support on my list of things to look at,
> >> >> >> did you have raid0 in your plans?
> >> >> >
> >> >> > Yes, I hope to extend further and raid0 is a good candidate.
> >> >>
> >> >> dm-flakey would allow more xfstests test cases to run. I'd say that's
> >> >> more important than linear or raid0. ;-)
> >> >
> >> > Regardless of which target(s) grow DAX support the most pressing initial
> >> > concern is getting the DM device stacking correct. And verifying that
> >> > IO that cross pmem device boundaries are being properly split by DM
> >> > core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
> >> > max_io_len).
> >>
> >> That was a tongue-in-cheek comment. You're reading way too much into
> >> it.
> >>
> >> >> Also, the next step in this work is to then decide how to determine on
> >> >> what numa node an LBA resides. We had discussed this at a prior
> >> >> plumbers conference, and I think the consensus was to use xattrs.
> >> >> Toshi, do you also plan to do that work?
> >> >
> >> > How does the associated NUMA node relate to this? Does the
> >> > DM requests_queue need to be setup to only allocate from the NUMA node
> >> > the pmem device is attached to? I recently added support for this to
> >> > DM. But there will likely be some code need to propagate the NUMA node
> >> > id accordingly.
> >>
> >> I assume you mean allocate memory (the volatile kind). That should work
> >> the same between pmem and regular block devices, no?
> >
> > This is the commit I made to train DM to be numa node aware:
> > 115485e83f497fdf9b4 ("dm: add 'dm_numa_node' module parameter")
>
> Hmm, but this is global for all DM device instances.

Right, only because I didn't have a convenient way to allow the user to
specify it on a per-device level. But I'll defer skinning that cat for
now since in this pmem case we'd inherit from the underlying device(s)

> > As is the DM code is focused on memory allocations. But I think blk-mq
> > may use the NUMA node for via tag_set->numa_node. But that is moot
> > given pmem is bio-based right?
>
> Right.
>
> >
> > Steps could be taken to make all threads DM creates for a a given device
> > get pinned to the specified NUMA node too.
>
> I think it would be useful if a DM instance inherited the numa node
> from the component devices by default (assuming they're all from the
> same node). A "dev_to_node(disk_to_dev(disk))" conversion works for
> pmem devices.

OK, I can look to make that happen.

> As far as I understand, Jeff wants to go further and have a linear
> span across component devices from different nodes with an interface
> to do an LBA-to-numa-node conversion.

All that variability makes DM's ability to do anything sane with it
close to impossible considering memory pools, threads, etc are all
pinned during the first activation of the DM device.