Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

From: Dan Williams
Date: Fri May 29 2015 - 11:48:16 EST


On Fri, May 29, 2015 at 5:07 AM, Ross Zwisler <zwisler@xxxxxxxxx> wrote:
> On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
>> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
>> > Add a new PMEM API to x86, and allow for architectures that do not
>> > implement this API. Architectures that implement the PMEM API
>> > should
>> > define ARCH_HAS_PMEM_API in their kernel configuration and must
>> > provide
>> > implementations for persistent_copy(), persistent_flush() and
>> > persistent_sync().
>>
>> >
>> > void clflush_cache_range(void *addr, unsigned int size);
>> >
>>
>> No, no, no, no, no. Include the proper header file.
>
> I'm confused - I did inlcude <asm/cacheflush.h> in pmem.h? The line
> you're quoting above was an unmodified line from asm/cacheflush.h - I
> didn't redefine the prototype for clflush_cache_range() anywhere.
>
> Or does this comment mean that you think we shouldn't have an
> architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
> drivers should just directly include asm/cacheflush.h and use the x86
> APIs directly?
>
>> > +static inline void arch_persistent_flush(void *vaddr, size_t size)
>> > +{
>> > + clflush_cache_range(vaddr, size);
>> > +}
>>
>> Shouldn't this really be using clwb() -- we really need a
>> clwb_cache_range() I guess?
>
> I think we will need a clwb_cache_range() for DAX, for when it responds
> to a msync() or fsync() call and needs to rip through a bunch of
> memory, writing it back to the DIMMs. I just didn't add it yet because
> I didn't have a consumer.
>
> It turns out that for the block aperture I/O case we really do need a
> flush instead of a writeback, though, so clflush_cache_range() is
> perfect. Here's the flow, which is a read from a block window
> aperture:
>
> 1) The nd_blk driver gets a read request, and selects a block window to
> use. It's entirely possible that this block window's aperture has
> clean cache lines associated with it in the processor cache hierarchy.
> It shouldn't be possible that it has dirty cache lines - we either
> just did a read, or we did a write and would have used NT stores.
>
> 2) Write a new address into the block window control register. The
> memory backing the aperture moves to the new address. Any clean lines
> held in the processor cache are now out of sync.
>
> 3) Flush the cache lines associated with the aperture. The lines are
> guaranteed to be clean, so the flush will just discard them and no
> writebacks will occur.
>
> 4) Read the contents of the block aperture, servicing the read.
>
> This is the read flow outlined it the "driver writer's guide":
>
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
>> Incidentally, clflush_cache_range() seems to have the same flaw as
>> the
>> proposed use case for clwb() had... if the buffer is aligned it will
>> needlessly flush the last line twice. It should really look
>> something
>> like this (which would be a good standalone patch):
>>
>> void clflush_cache_range(void *vaddr, unsigned int size)
>> {
>> void *vend = vaddr + size - 1;
>>
>> mb();
>>
>> vaddr = (void *)
>> ((unsigned long)vaddr
>> & ~(boot_cpu_data.x86_clflush_size - 1));
>>
>> for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
>> clflushopt(vaddr);
>>
>> mb();
>> }
>> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> Ah, yep, I saw the same thing and already submitted patches to fix. I
> think this change should be in the TIP tree:
>
> https://lkml.org/lkml/2015/5/11/336
>
>
>> I also note that with your implementation we have a wmb() in
>> arch_persistent_sync() and an mb() in arch_persistent_flush()...
>> surely one is redundant?
>
> Actually, the way that we need to use arch_persistent_flush() for our
> block window read case, the fencing works out so that nothing is
> redundant. We never actually use both a persistent_sync() call and a
> persistent_flush() call during the same I/O. Reads use
> persistent_flush() to invalidate obsolete lines in the cache before
> reading real data from the aperture of ete DIMM, and writes use a bunch
> of NT stores followed by a persistent_sync().
>
> The PMEM driver doesn't use persistent_flush() at all - this API is
> only needed for the block window read case.

Then that's not a "persistence flush", that's a shootdown of the
previous mmio block window setting. If it's only for BLK reads I
think we need to use clflush_cache_range() directly. Given that BLK
mode already depends on ACPI I think it's fine for now to make BLK
mode depend on x86. Otherwise, we need a new cross-arch generic cache
flush primitive like io_flush_cache_range() and have BLK mode depend
on ARCH_HAS_IO_FLUSH_CACHE_RANGE.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/