Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
From: Dan Williams
Date: Sat Nov 07 2015 - 04:03:05 EST
On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Sat, 7 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> >> On 11/06/15 15:17, Dan Williams wrote:
>> >> >>
>> >> >> Is it really required to do that on all cpus?
>> >> >
>> >> > I believe it is, but I'll double check.
>> >> >
>> >>
>> >> It's required on all CPUs on which the DAX memory may have been dirtied.
>> >> This is similar to the way we flush TLBs.
>> >
>> > Right. And that's exactly the problem: "may have been dirtied"
>> >
>> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
>> > happily in user space or run low latency RT tasks w/o ever touching
>> > it, then having an unconditional flush on ALL CPUs is just wrong
>> > because you penalize the uninvolved cores with a completely pointless
>> > SMP function call and drain their caches.
>> >
>>
>> It's not wrong and pointless, it's all we have available outside of
>> having the kernel remember every virtual address that might have been
>> touched since the last fsync and sit in a loop flushing those virtual
>> address cache line by cache line.
>>
>> There is a crossover point where wbinvd is better than a clwb loop
>> that needs to be determined.
>
> This is a totally different issue and I'm well aware that there is a
> tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
> efficient performance wise above some number of cache lines, but then
> again it's draining all unrelated stuff as well, which can result in a
> even larger performance hit.
>
> Now what really concerns me more is that you just unconditionally
> flush on all CPUs whether they were involved in that DAX stuff or not.
>
> Assume that DAX using application on CPU 0-3 and some other unrelated
> workload on CPU4-7. That flush will
>
> - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
>
> - Drain the cache for CPU4-7 for no reason if done with wbinvd()
>
> - Render Cache Allocation useless if done with wbinvd()
>
> And we are not talking about a few micro seconds here. Assume that
> CPU4-7 have cache allocated and it's mostly dirty. We've measured the
> wbinvd() impact on RT, back then when the graphic folks used it as a
> big hammer. The maximum latency spike was way above one millisecond.
>
> We have similar issues with TLB flushing, but there we
>
> - are tracking where it was used and never flush on innocent cpus
>
> - one can design his application in a way that it uses different
> processes so cross CPU flushing does not happen
>
> I know that this is not an easy problem to solve, but you should be
> aware that various application scenarios are going to be massively
> unhappy about that.
>
Thanks for that explanation. Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
latency, hmm...
One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established. With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache. In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself". We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates? In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...
Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some
situations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/