Re: [PATCH v2 2/3] /dev/dax, core: file operations and dax-mmap

From: Paul E. McKenney
Date: Wed May 18 2016 - 13:57:11 EST


On Wed, May 18, 2016 at 10:26:57AM -0700, Dan Williams wrote:
> On Wed, May 18, 2016 at 10:12 AM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Wed, May 18, 2016 at 11:15:11AM +0200, Hannes Reinecke wrote:
> >> On 05/18/2016 11:10 AM, Paul Mackerras wrote:
> >> > On Wed, May 18, 2016 at 10:07:19AM +0200, Hannes Reinecke wrote:
> >> >> On 05/18/2016 12:19 AM, Dan Williams wrote:
> >> >>> On Tue, May 17, 2016 at 3:57 AM, Johannes Thumshirn <jthumshirn@xxxxxxx> wrote:
> >> >>>> On Sat, May 14, 2016 at 11:26:29PM -0700, Dan Williams wrote:
> >> >>>>> The "Device DAX" core enables dax mappings of performance / feature
> >> >>>>> differentiated memory. An open mapping or file handle keeps the backing
> >> >>>>> struct device live, but new mappings are only possible while the device
> >> >>>>> is enabled. Faults are handled under rcu_read_lock to synchronize
> >> >>>>> with the enabled state of the device.
> >> >>>>>
> >> >>>>> Similar to the filesystem-dax case the backing memory may optionally
> >> >>>>> have struct page entries. However, unlike fs-dax there is no support
> >> >>>>> for private mappings, or mappings that are not backed by media (see
> >> >>>>> use of zero-page in fs-dax).
> >> >>>>>
> >> >>>>> Mappings are always guaranteed to match the alignment of the dax_region.
> >> >>>>> If the dax_region is configured to have a 2MB alignment, all mappings
> >> >>>>> are guaranteed to be backed by a pmd entry. Contrast this determinism
> >> >>>>> with the fs-dax case where pmd mappings are opportunistic. If userspace
> >> >>>>> attempts to force a misaligned mapping, the driver will fail the mmap
> >> >>>>> attempt. See dax_dev_check_vma() for other scenarios that are rejected,
> >> >>>>> like MAP_PRIVATE mappings.
> >> >>>>>
> >> >>>>> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx>
> >> >>>>> Cc: Christoph Hellwig <hch@xxxxxx>
> >> >>>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >> >>>>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> >> >>>>> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> >> >>>>> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> >> >>>>> ---
> >> >>>>> drivers/dax/Kconfig | 1
> >> >>>>> drivers/dax/dax.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>>>> mm/huge_memory.c | 1
> >> >>>>> mm/hugetlb.c | 1
> >> >>>>> 4 files changed, 319 insertions(+)
> >> >>>>>
> >> >>>>> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> >> >>>>> index 86ffbaa891ad..cedab7572de3 100644
> >> >>>>> --- a/drivers/dax/Kconfig
> >> >>>>> +++ b/drivers/dax/Kconfig
> >> >>>>> @@ -1,6 +1,7 @@
> >> >>>>> menuconfig DEV_DAX
> >> >>>>> tristate "DAX: direct access to differentiated memory"
> >> >>>>> default m if NVDIMM_DAX
> >> >>>>> + depends on TRANSPARENT_HUGEPAGE
> >> >>>>> help
> >> >>>>> Support raw access to differentiated (persistence, bandwidth,
> >> >>>>> latency...) memory via an mmap(2) capable character
> >> >>>>> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
> >> >>>>> index 8207fb33a992..b2fe8a0ce866 100644
> >> >>>>> --- a/drivers/dax/dax.c
> >> >>>>> +++ b/drivers/dax/dax.c
> >> >>>>> @@ -49,6 +49,7 @@ struct dax_region {
> >> >>>>> * @region - parent region
> >> >>>>> * @dev - device backing the character device
> >> >>>>> * @kref - enable this data to be tracked in filp->private_data
> >> >>>>> + * @alive - !alive + rcu grace period == no new mappings can be established
> >> >>>>> * @id - child id in the region
> >> >>>>> * @num_resources - number of physical address extents in this device
> >> >>>>> * @res - array of physical address ranges
> >> >>>>> @@ -57,6 +58,7 @@ struct dax_dev {
> >> >>>>> struct dax_region *region;
> >> >>>>> struct device *dev;
> >> >>>>> struct kref kref;
> >> >>>>> + bool alive;
> >> >>>>> int id;
> >> >>>>> int num_resources;
> >> >>>>> struct resource res[0];
> >> >>>>> @@ -150,6 +152,10 @@ static void destroy_dax_dev(void *_dev)
> >> >>>>>
> >> >>>>> dev_dbg(dev, "%s\n", __func__);
> >> >>>>>
> >> >>>>> + /* disable and flush fault handlers, TODO unmap inodes */
> >> >>>>> + dax_dev->alive = false;
> >> >>>>> + synchronize_rcu();
> >
> > If you need to wait for fault handlers, you need synchronize_sched()
> > instead of synchronize_rcu(). Please note that synchronize_rcu() is
> > guaranteed to wait only for tasks that have done rcu_read_lock() to reach
> > the corresponding rcu_read_unlock(). In contrast, synchronize_sched()
> > is guaranteed to wait for any non-idle preempt-disable region of code
> > to complete, regardless of exactly what is disabling preemptiong.
> >
> > And the "non-idle" is not an idle qualifier. If you need to wait on fault
> > handlers that somehow occur from an idle hardware thread, you will need
> > those fault handlers to do rcu_irq_enter() on entry and rcu_irq_exit()
> > on exit. (My guess is that you cannot take faults in the idle loop,
> > but I have learned not to trust such guesses all that far.)
> >
> > And last, but definitely not least, synchronize_sched() waits only
> > for pre-existing preempt-disable regions of code. So if you do
> > synchronize_sched(), and immediately after a fault handler starts,
> > synchronize_sched() won't necessarily wait on it. However, you -are-
> > guaranteed that synchronize_shced() -will- wait for any fault handler
> > that might possibly see dax_dev->alive with a non-false value.
> >
> > Are these the guarantees you are looking for? (Yes, I did recently
> > watch "A New Hope". Why do you ask?)
>
> Spoken like a true rcu-Jedi.

;-)

> So in this case the fault handlers are indeed running under
> rcu_read_lock(), and I can't fathom how these faults would trigger
> from an idle thread...

OK, given all fault handlers use rcu_read_lock() and either:

1. As you say, faults never trigger from idle, or

2. The fault handlers call rcu_irq_enter() on entry and
rcu_irq_exit() on exit

Then, yes, you can keep on using synchronize_rcu().

> >> >>>>> +
> >> >>>>
> >> >>>> IIRC RCU is only protecting a pointer, not the content of the pointer, so this
> >> >>>> looks wrong to me.
> >
> > RCU can be, and usually is, used to protect pointers, but it can be and
> > sometimes is used for other things as well. At its core, RCU is about
> > waiting for pre-existing RCU readers to complete.
> >
> >> >>> The driver is using RCU to guarantee that all currently running fault
> >> >>> handlers have either completed or will see the new state of ->alive
> >> >>> when they start. Reference counts are protecting the actual dax_dev
> >> >>> object.
> >> >>>
> >> >> Hmm.
> >> >> This is the same 'creative' RCU usage Mike Snitzer has been trying
> >> >> when trying to improve device-mapper performance.
> >
> > To repeat, unless all your fault handlers begin with rcu_read_lock()
> > and end with rcu_read_unlock(), and as long as you don't care about not
> > waiting for fault handlers that are currently executing just before
> > the rcu_read_lock() and just after the rcu_read_unlock(), you need
> > synchronize_sched() rather than synchronize_rcu() for this job.
> >
> >> >> >From my understanding RCU is protecting the _pointer_, not the
> >> >> values of the structure pointed to.
> >> >> IOW we are guaranteed to have a valid pointer at any time.
> >> >> But at the same time _no_ guarantee is made about the _contents_ of
> >> >> the structure.
> >> >> It might well be that using 'synchronize_rcu' giving you similar
> >> >> results (as synchronize_rcu() is essentially waiting a SMP grace
> >> >> period, after which all CPUs should be seeing the update).
> >> >> However, I haven't been able to find that this is a guaranteed
> >> >> behaviour.
> >> >> So from my understanding you have to use locking primitives
> >> >> protecting the contents of the structure or exchange the _entire_
> >> >> structure if you want to rely on RCU here.
> >> >>
> >> >> Can we get some clarification here?
> >
> > Maybe... What exactly is your synchronization design needing here?
> >
> >> >> Paul?
> >> >
> >> > I think you want the other Paul, Paul McKenney.
> >> >
> >> I think you are in fact right.
> >> Sorry for the Paul-confusion :-)
> >
> > Did I keep my end of the confusion up? ;-)
>
> Yes, I think we're good, but please double check I am not mistaken in
> the following clarification comment:
>
> @@ -150,6 +152,16 @@ static void destroy_dax_dev(void *_dev)
>
> dev_dbg(dev, "%s\n", __func__);
>
> + /*
> + * Note, rcu is not protecting the liveness of dax_dev, rcu is
> + * ensuring that any fault handlers that might have seen
> + * dax_dev->alive == true, have completed. Any fault handlers
> + * that start after synchronize_rcu() has started will abort
> + * upon seeing dax_dev->alive == false.
> + */
> + dax_dev->alive = false;
> + synchronize_rcu();
> +
> get_device(dev);
> device_unregister(dev);
> ida_simple_remove(&dax_region->ida, dax_dev->id);
> @@ -173,6 +185,7 @@ int devm_create_dax_dev(struct dax_region
> *dax_region, struct resource *re

Given your comment and your statement that fault handlers never happen
on idle CPUs and are always protected by rcu_read_lock(), from an RCU
point of view:

Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>

Thanx, Paul