Re: [RFC PATCH v2 11/12] x86/mm/tlb: Use async and inline messages for flushing

From: Nadav Amit
Date: Fri Jun 07 2019 - 01:33:03 EST


> On May 31, 2019, at 3:07 PM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>
>> On May 31, 2019, at 2:47 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>
>>
>> On May 31, 2019, at 2:33 PM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>>
>>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>
>>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>>> When we flush userspace mappings, we can defer the TLB flushes, as long
>>>>> the following conditions are met:
>>>>>
>>>>> 1. No tables are freed, since otherwise speculative page walks might
>>>>> cause machine-checks.
>>>>>
>>>>> 2. No one would access userspace before flush takes place. Specifically,
>>>>> NMI handlers and kprobes would avoid accessing userspace.
>>>>
>>>> I think I need to ask the big picture question. When someone calls
>>>> flush_tlb_mm_range() (or the other entry points), if no page tables
>>>> were freed, they want the guarantee that future accesses (initiated
>>>> observably after the flush returns) will not use paging entries that
>>>> were replaced by stores ordered before flush_tlb_mm_range(). We also
>>>> need the guarantee that any effects from any memory access using the
>>>> old paging entries will become globally visible before
>>>> flush_tlb_mm_range().
>>>>
>>>> I'm wondering if receipt of an IPI is enough to guarantee any of this.
>>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI
>>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the
>>>> dirty bit? An interrupt entry today is fully serializing by the time
>>>> it finishes, but interrupt entries are epicly slow, and I don't know
>>>> if the APIC waits long enough. Heck, what if IRQs are off on the
>>>> remote CPU? There are a handful of places where we touch user memory
>>>> with IRQs off, and it's (sadly) possible for user code to turn off
>>>> IRQs with iopl().
>>>>
>>>> I *think* that Intel has stated recently that SMT siblings are
>>>> guaranteed to stop speculating when you write to the APIC ICR to poke
>>>> them, but SMT is very special.
>>>>
>>>> My general conclusion is that I think the code needs to document what
>>>> is guaranteed and why.
>>>
>>> I think I might have managed to confuse you with a bug I made (last minute
>>> bug when I was doing some cleanup). This bug does not affect the performance
>>> much, but it might led you to think that I use the APIC sending as
>>> synchronization.
>>>
>>> The idea is not for us to rely on write to ICR as something serializing. The
>>> flow should be as follows:
>>>
>>>
>>> CPU0 CPU1
>>>
>>> flush_tlb_mm_range()
>>> __smp_call_function_many()
>>> [ prepare call_single_data (csd) ]
>>> [ lock csd ]
>>> [ send IPI ]
>>> (*)
>>> [ wait for csd to be unlocked ]
>>> [ interrupt ]
>>> [ copy csd info to stack ]
>>> [ csd unlock ]
>>> [ find csd is unlocked ]
>>> [ continue (**) ]
>>> [ flush TLB ]
>>>
>>>
>>> At (**) the pages might be recycled, written-back to disk, etc. Note that
>>> during (*), CPU0 might do some local TLB flushes, making it very likely that
>>> CSD will be unlocked by the time it gets there.
>>>
>>> As you can see, I donât rely on any special micro-architectural behavior.
>>> The synchronization is done purely in software.
>>>
>>> Does it make more sense now?
>>
>> Yes. Have you benchmarked this?
>
> Partially. Numbers are indeed worse. Here are preliminary results, comparing
> to v1 (concurrent):
>
> n_threads before concurrent +async
> --------- ------ ---------- ------
> 1 661 663 663
> 2 1436 1225 (-14%) 1115 (-22%)
> 4 1571 1421 (-10%) 1289 (-18%)
>
> Note that the benefit of âasync" would be greater if the initiator does not
> flush the TLB at all. This might happen in the case of kswapd, for example.
> Let me try some micro-optimizations first, run more benchmarks and get back
> to you.

So I ran some more benchmarking (my benchmark is not very suitable), and tried
more stuff that did not help (checking for more work before returning from the
IPI handler, and avoid redundant IPIs in such case).

Anyhow, with a fixed version, I ran a more standard benchmark on DAX:

$ mkfs.ext4 /dev/pmem0
$ mount -o dax /dev/pmem0 /mnt/mem
$ cd /mnt/mem
$ bash -c 'echo 0 > /sys/devices/platform/e820_pmem/ndbus0/region0/namespace0.0/block/pmem0/dax/write_cacheâ
$ sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync prepare
$ sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

( as you can see, I disabled the write-cache, since my machine does not have
clwb/clflushopt and clflush appears to become a bottleneck otherwise )


The results are:
events (avg/stddev)
-------------------
base 1263689.0000/11481.10
concurrent 1310123.5000/19205.79 (+3.6%)
concurrent + async 1326750.2500/24563.61 (+4.9%)

So which version do you want me to submit? With or without the async part?