Re: [PATCH] dax, pmem: add support for msync

From: Boaz Harrosh
Date: Wed Sep 02 2015 - 06:27:43 EST


On 09/02/2015 08:17 AM, Dave Chinner wrote:
> On Tue, Sep 01, 2015 at 09:19:45PM -0600, Ross Zwisler wrote:
>> On Wed, Sep 02, 2015 at 08:21:20AM +1000, Dave Chinner wrote:
>>> Which means applications that should "just work" without
>>> modification on DAX are now subtly broken and don't actually
>>> guarantee data is safe after a crash. That's a pretty nasty
>>> landmine, and goes against *everything* we've claimed about using
>>> DAX with existing applications.
>>>
>>> That's wrong, and needs fixing.
>>
>> I agree that we need to fix fsync as well, and that the fsync solution could
>> be used to implement msync if we choose to go that route. I think we might
>> want to consider keeping the msync and fsync implementations separate, though,
>> for two reasons.
>>
>> 1) The current msync implementation is much more efficient than what will be
>> needed for fsync. Fsync will need to call into the filesystem, traverse all
>> the blocks, get kernel virtual addresses from those and then call
>> wb_cache_pmem() on those kernel addresses. I think this is a necessary evil
>> for fsync since you don't have a VMA, but for msync we do and we can just
>> flush using the user addresses without any fs lookups.
>
> Yet you're ignoring the fact that flushing the entire range of the
> relevant VMAs may not be very efficient. It may be a very
> large mapping with only a few pages that need flushing from the
> cache, but you still iterate the mappings flushing GB ranges from
> the cache at a time.
>

So actually you are wrong about this. We have a working system and as part
of our testing rig we do performance measurements, constantly. Our random
mmap 4k writes test preforms very well and is in par with the random-direct-write
implementation even though on every unmap, we do a VMA->start/end cl_flushing.

The cl_flush operation is a no-op if the cacheline is not dirty and is a
memory bus storm with all the CLs that are dirty. So the only cost
is the iteration of vma->start-to-vma->end i+=64

Let us please agree that we should do the correct thing for now, and let
the complains roll in about the slowness later. You will find that my
proposed solution is not so slow.

> We don't need struct pages to track page dirty state. We already
> have a method for doing this that does not rely on having a struct
> page and can be used for efficient lookup of exact dirty ranges. i.e
> the per-page dirty tag that is kept in the mapping radix tree. It's
> fine grained, and extremely efficient in terms of lookups to find
> dirty pages.
>

In fact you will find that this solution is actually slower. Because
you need an extra lock on every major-page-fault and you need to
maintain the radix-tree populated. Today with dax the radix-tree is
completely empty, it is only ever used if one reads in holes. But we
found that this is not common at all. Usually mmap applications read
what is really there.

So the extra work per page will be more than actually doing the fast
no-op cl_flush.

> Indeed, the mapping tree tags were specifically designed to avoid
> this "fsync doesn't know what range to flush" problem for normal
> files. That same problem still exists here for msync - these patches
> are just hitting it with a goddamn massive hammer "because it is
> easy" rather than attempting to do the flushing efficiently.
>

You come from the disk world and every extra synced block is a huge waist.
This is memory, is not what you are used too.

Again we have benchmarks and mmap works very very well. Including that
contraption of cl_flushing vma->start..vma->end

I'll try and open up some time to send a rough draft. My boss will kill me,
but he'll survive.

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/