Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap

From: Dan Williams
Date: Wed May 18 2016 - 13:27:04 EST


On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney
<paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote:
>> On 05/18/2016 11:10 AM, Paul Mackerras wrote:
>> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote:
>> >> On 05/18/2016 12:19 AM, Dan Williams wrote:
>> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote:
>> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote:
>> >>>>> The "Device DAX" core enables dax mappings of performance / feature
>> >>>>> differentiated memory. An open mapping or file handle keeps the backing
>> >>>>> struct device live, but new mappings are only possible while the device
>> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize
>> >>>>> with the enabled state of the device.
>> >>>>>
>> >>>>> Similar to the filesystem-dax case the backing memory may optionally
>> >>>>> have struct page entries. However, unlike fs-dax there is no support
>> >>>>> for private mappings, or mappings that are not backed by media (see
>> >>>>> use of zero-page in fs-dax).
>> >>>>>
>> >>>>> Mappings are always guaranteed to match the alignment of the dax_region.
>> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings
>> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism
>> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace
>> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap
>> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected,
>> >>>>> like MAP_PRIVATE mappings.
>> >>>>>
>> >>>>> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx>
>> >>>>> Cc: Christoph Hellwig <hch@xxxxxx>
>> >>>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> >>>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>> >>>>> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
>> >>>>> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>> >>>>> ---
>> >>>>> drivers/dax/Kconfig | 1
>> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>>> mm/huge_memory.c | 1
>> >>>>> mm/hugetlb.c | 1
>> >>>>> 4 files changed, 319 insertions(+)
>> >>>>>
>> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
>> >>>>> index 86ffbaa891ad..cedab7572de3 100644
>> >>>>> --- a/drivers/dax/Kconfig
>> >>>>> +++ b/drivers/dax/Kconfig
>> >>>>> @@ -1,6 +1,7 @@
>> >>>>> menuconfig DEV_DAX
>> >>>>> tristate "DAX: direct access to differentiated memory"
>> >>>>> default m if NVDIMM_DAX
>> >>>>> + depends on TRANSPARENT_HUGEPAGE
>> >>>>> help
>> >>>>> Support raw access to differentiated (persistence, bandwidth,
>> >>>>> latency...) memory via an mmap(2) capable character
>> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
>> >>>>> index 8207fb33a992..b2fe8a0ce866 100644
>> >>>>> --- a/drivers/dax/dax.c
>> >>>>> +++ b/drivers/dax/dax.c
>> >>>>> @@ -49,6 +49,7 @@ struct dax_region {
>> >>>>> * @region - parent region
>> >>>>> * @dev - device backing the character device
>> >>>>> * @kref - enable this data to be tracked in filp->private_data
>> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established
>> >>>>> * @id - child id in the region
>> >>>>> * @num_resources - number of physical address extents in this device
>> >>>>> * @res - array of physical address ranges
>> >>>>> @@ -57,6 +58,7 @@ struct dax_dev {
>> >>>>> struct dax_region *region;
>> >>>>> struct device *dev;
>> >>>>> struct kref kref;
>> >>>>> + bool alive;
>> >>>>> int id;
>> >>>>> int num_resources;
>> >>>>> struct resource res[0];
>> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev)
>> >>>>>
>> >>>>> dev_dbg(dev, "%s\n", __func__);
>> >>>>>
>> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */
>> >>>>> + dax_dev->alive = false;
>> >>>>> + synchronize_rcu();
>
> If you need to wait for fault handlers, you need synchronize_sched()
> instead of synchronize_rcu(). Please note that synchronize_rcu() is
> guaranteed to wait only for tasks that have done rcu_read_lock() to reach
> the corresponding rcu_read_unlock(). In contrast, synchronize_sched()
> is guaranteed to wait for any non-idle preempt-disable region of code
> to complete, regardless of exactly what is disabling preemptiong.
>
> And the "non-idle" is not an idle qualifier. If you need to wait on fault
> handlers that somehow occur from an idle hardware thread, you will need
> those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit()
> on exit. (My guess is that you cannot take faults in the idle loop,
> but I have learned not to trust such guesses all that far.)
>
> And last, but definitely not least, synchronize_sched() waits only
> for pre-existing preempt-disable regions of code. So if you do
> synchronize_sched(), and immediately after a fault handler starts,
> synchronize_sched() won't necessarily wait on it. However, you -are-
> guaranteed that synchronize_shced() -will- wait for any fault handler
> that might possibly see dax_dev->alive with a non-false value.
>
> Are these the guarantees you are looking for? (Yes, I did recently
> watch "A New Hope". Why do you ask?)

Spoken like a true rcu-Jedi.

So in this case the fault handlers are indeed running under
rcu_read_lock(), and I can't fathom how these faults would trigger
from an idle thread...

>
>> >>>>> +
>> >>>>
>> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this
>> >>>> looks wrong to me.
>
> RCU can be, and usually is, used to protect pointers, but it can be and
> sometimes is used for other things as well. At its core, RCU is about
> waiting for pre-existing RCU readers to complete.
>
>> >>> The driver is using RCU to guarantee that all currently running fault
>> >>> handlers have either completed or will see the new state of ->alive
>> >>> when they start. Reference counts are protecting the actual dax_dev
>> >>> object.
>> >>>
>> >> Hmm.
>> >> This is the same 'creative' RCU usage Mike Snitzer has been trying
>> >> when trying to improve device-mapper performance.
>
> To repeat, unless all your fault handlers begin with rcu_read_lock()
> and end with rcu_read_unlock(), and as long as you don't care about not
> waiting for fault handlers that are currently executing just before
> the rcu_read_lock() and just after the rcu_read_unlock(), you need
> synchronize_sched() rather than synchronize_rcu() for this job.
>
>> >> >From my understanding RCU is protecting the _pointer_, not the
>> >> values of the structure pointed to.
>> >> IOW we are guaranteed to have a valid pointer at any time.
>> >> But at the same time _no_ guarantee is made about the _contents_ of
>> >> the structure.
>> >> It might well be that using 'synchronize_rcu' giving you similar
>> >> results (as synchronize_rcu() is essentially waiting a SMP grace
>> >> period, after which all CPUs should be seeing the update).
>> >> However, I haven't been able to find that this is a guaranteed
>> >> behaviour.
>> >> So from my understanding you have to use locking primitives
>> >> protecting the contents of the structure or exchange the _entire_
>> >> structure if you want to rely on RCU here.
>> >>
>> >> Can we get some clarification here?
>
> Maybe... What exactly is your synchronization design needing here?
>
>> >> Paul?
>> >
>> > I think you want the other Paul, Paul McKenney.
>> >
>> I think you are in fact right.
>> Sorry for the Paul-confusion :-)
>
> Did I keep my end of the confusion up? ;-)

Yes, I think we're good, but please double check I am not mistaken in
the following clarification comment:

@@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev)

dev_dbg(dev, "%s\n", __func__);

+ /*
+ * Note, rcu is not protecting the liveness of dax_dev, rcu is
+ * ensuring that any fault handlers that might have seen
+ * dax_dev->alive == true, have completed. Any fault handlers
+ * that start after synchronize_rcu() has started will abort
+ * upon seeing dax_dev->alive == false.
+ */
+ dax_dev->alive = false;
+ synchronize_rcu();
+
get_device(dev);
device_unregister(dev);
ida_simple_remove(&dax_region->ida, dax_dev->id);
@@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region
*dax_region, struct resource *re