Re: [PATCH] mm: Add new vma flag VM_LOCAL_CPU
From: Boaz Harrosh
Date: Tue May 15 2018 - 13:39:28 EST
On 15/05/18 03:44, Matthew Wilcox wrote:
> On Mon, May 14, 2018 at 02:49:01PM -0700, Andrew Morton wrote:
>> On Mon, 14 May 2018 20:28:01 +0300 Boaz Harrosh <boazh@xxxxxxxxxx> wrote:
>>> In this project we utilize a per-core server thread so everything
>>> is kept local. If we use the regular zap_ptes() API All CPU's
>>> are scheduled for the unmap, though in our case we know that we
>>> have only used a single core. The regular zap_ptes adds a very big
>>> latency on every operation and mostly kills the concurrency of the
>>> over all system. Because it imposes a serialization between all cores
>>
>> I'd have thought that in this situation, only the local CPU's bit is
>> set in the vma's mm_cpumask() and the remote invalidations are not
>> performed. Is that a misunderstanding, or is all that stuff not working
>> correctly?
>
> I think you misunderstand Boaz's architecture. He has one thread per CPU,
> so every bit will be set in the mm's (not vma's) mm_cpumask.
>
Hi Andrew, Matthew
Yes I have been trying to investigate and trace this for days.
Please see the code below:
> #define flush_tlb_range(vma, start, end) \
> flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
The mm_struct @mm below comes from here vma->vm_mm
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a..1d398a0 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -611,39 +611,40 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> unsigned long end, unsigned long vmflag)
> {
> int cpu;
>
> struct flush_tlb_info info __aligned(SMP_CACHE_BYTES) = {
> .mm = mm,
> };
>
> cpu = get_cpu();
>
> /* This is also a barrier that synchronizes with switch_mm(). */
> info.new_tlb_gen = inc_mm_tlb_gen(mm);
>
> /* Should we flush just the requested range? */
> if ((end != TLB_FLUSH_ALL) &&
> !(vmflag & VM_HUGETLB) &&
> ((end - start) >> PAGE_SHIFT) <= tlb_single_page_flush_ceiling) {
> info.start = start;
> info.end = end;
> } else {
> info.start = 0UL;
> info.end = TLB_FLUSH_ALL;
> }
>
> if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> VM_WARN_ON(irqs_disabled());
> local_irq_disable();
> flush_tlb_func_local(&info, TLB_LOCAL_MM_SHOOTDOWN);
> local_irq_enable();
> }
>
> - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> + if (!(vmflag & VM_LOCAL_CPU) &&
> + cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> flush_tlb_others(mm_cpumask(mm), &info);
>
I have been tracing the mm_cpumask(vma->vm_mm) at my driver at
different points. At vma creation (file_operations->mmap()),
and before the call to insert_pfn (which calls here).
At the beginning I was wishful thinking that the mm_cpumask(vma->vm_mm)
should have a single bit set just as the affinity of the thread on
creation of that thread. But then I saw that at %80 of the times some
other random bits are also set.
Yes Random. Always the thread affinity (single) bit was set but
then zero one or two more bits were set as well. Never seen more then
two though, which baffles me a lot.
If it was like Matthew said .i.e the cpumask of the all process
then I would expect all the bits to be set. Because I have a thread
on each core. And also I would even expect that all vma->vm_mm
or maybe mm_cpumask(vma->vm_mm) to point to the same global object.
But it was not so. it was pointing to some thread unique object but
still those phantom bits were set all over. (And I am almost sure
same vma had those bits change over time)
So I would love some mm guy to explain where are those bits collected?
But I do not think this is a Kernel bug because as Matthew showed.
that vma above can and is allowed to leak memory addresses to other
threads / cores in the same process. So it appears that the Kernel
has some correct logic behind its madness.
Which brings me to another question. How can I find from
within a thread Say at the file_operations->mmap() call that the thread
is indeed core-pinned. What mm_cpumask should I inspect?
> put_cpu();
> }
Thanks
Boaz