Re: [PATCH 0/6] Support DAX for device-mapper dm-linear devices

From: Kani, Toshimitsu
Date: Tue Jun 14 2016 - 14:00:16 EST


On Tue, 2016-06-14 at 11:41 -0400, Mike Snitzer wrote:
> On Tue, Jun 14 2016 atÂÂ9:50am -0400,
> Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> > "Kani, Toshimitsu" <toshi.kani@xxxxxxx> writes:
> > > > I had dm-linear and md-raid0 support on my list of things to look
> > > > at, did you have raid0 in your plans?
> > >
> > > Yes, I hope to extend further and raid0 is a good candidate.
> > ÂÂ
> > dm-flakey would allow more xfstests test cases to run.ÂÂI'd say that's
> > more important than linear or raid0.ÂÂ;-)
>
> Regardless of which target(s) grow DAX support the most pressing initial
> concern is getting the DM device stacking correct.ÂÂAnd verifying that
> IO that cross pmem device boundaries are being properly split by DM
> core (via drivers/md/dm.c:__split_and_process_non_flush()'s call to
> max_io_len).

Agreed. I've briefly tested stacking and it seems working fine. ÂAs for IO
crossing pmem device boundaries,Â__split_and_process_non_flush() is used
when the device is mounted without DAX option. ÂWith DAX, this case is
handled by dm_blk_direct_access() that limits return size. ÂThis leads the
caller to iterate (read/write) or fallback to a smaller size (mmap pfault).

> My hope is to nail down the DM core and its dependencies in block etc.
> Doing so in terms of dm-linear doesn't seem like wasted effort
> considering you told me it'd be useful to have for pmem devices.

Yes, I think dm-linear is useful as it gives more flexibility, ex. it allows
creating a large device with multiple pmem devices.

> > Also, the next step in this work is to then decide how to determine on
> > what numa node an LBA resides.ÂÂWe had discussed this at a prior
> > plumbers conference, and I think the consensus was to use xattrs.
> > Toshi, do you also plan to do that work?
>
> How does the associated NUMA node relate to this?ÂÂDoes the
> DM requests_queue need to be setup to only allocate from the NUMA node
> the pmem device is attached to?ÂÂI recently added support for this to
> DM.ÂÂBut there will likely be some code need to propagate the NUMA node
> id accordingly.

Each pmem device has sysfs "numa_node" so that tools like numactl can be
used to bind application to run on the same locality as pmem device (since
CPU directly accesses to pmem). ÂThis won't work well with mapped device
since it can be composed with multiple localities. ÂLocality info would need
to be managed file-basis as Jeff mentioned.

Thanks,
-Toshi