Re: [PATCH] mm: cache largest vma
From: Davidlohr Bueso
Date: Sun Nov 03 2013 - 23:20:23 EST
On Sun, 2013-11-03 at 11:12 +0100, Ingo Molnar wrote:
> * Davidlohr Bueso <davidlohr@xxxxxx> wrote:
>
> > While caching the last used vma already does a nice job avoiding
> > having to iterate the rbtree in find_vma, we can improve. After
> > studying the hit rate on a load of workloads and environments,
> > it was seen that it was around 45-50% - constant for a standard
> > desktop system (gnome3 + evolution + firefox + a few xterms),
> > and multiple java related workloads (including Hadoop/terasort),
> > and aim7, which indicates it's better than the 35% value documented
> > in the code.
> >
> > By also caching the largest vma, that is, the one that contains
> > most addresses, there is a steady 10-15% hit rate gain, putting
> > it above the 60% region. This improvement comes at a very low
> > overhead for a miss. Furthermore, systems with !CONFIG_MMU keep
> > the current logic.
> >
> > This patch introduces a second mmap_cache pointer, which is just
> > as racy as the first, but as we already know, doesn't matter in
> > this context. For documentation purposes, I have also added the
> > ACCESS_ONCE() around mm->mmap_cache updates, keeping it consistent
> > with the reads.
> >
> > Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> > Cc: Michel Lespinasse <walken@xxxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Mel Gorman <mgorman@xxxxxxx>
> > Cc: Rik van Riel <riel@xxxxxxxxxx>
> > Cc: Guan Xuetao <gxt@xxxxxxxxxxxxxxx>
> > Signed-off-by: Davidlohr Bueso <davidlohr@xxxxxx>
> > ---
> > Please note that nommu and unicore32 arch are *untested*.
> >
> > I also have a patch on top of this one that caches the most
> > used vma, which adds another 8-10% hit rate gain, However,
> > since it does add a counter to the vma structure and we have
> > to do more logic in find_vma to keep track, I was hesitant about
> > the overhead. If folks are interested I can send that out as well.
>
> Would be interesting to see.
>
> Btw., roughly how many cycles/instructions do we save by increasing the
> hit rate, in the typical case (for example during a kernel build)?
Good point. The IPC from perf stat doesn't show any difference with or
without the patch -- note that this is probably the least interesting
one as we already get a really nice hit rate with the single mmap_cache.
I have yet to try it on the other workloads.
>
> That would be important to measure, so that we can get a ballpark figure
> for the cost/benefit equation.
>
> > Documentation/vm/locking | 4 +-
> > arch/unicore32/include/asm/mmu_context.h | 2 +-
> > include/linux/mm.h | 13 ++++++
> > include/linux/mm_types.h | 15 ++++++-
> > kernel/debug/debug_core.c | 17 +++++++-
> > kernel/fork.c | 2 +-
> > mm/mmap.c | 68 ++++++++++++++++++++------------
> > 7 files changed, 87 insertions(+), 34 deletions(-)
> >
> > diff --git a/Documentation/vm/locking b/Documentation/vm/locking
> > index f61228b..b4e8154 100644
> > --- a/Documentation/vm/locking
> > +++ b/Documentation/vm/locking
> > @@ -42,8 +42,8 @@ The rules are:
> > for mm B.
> >
> > The caveats are:
> > -1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
> > -The update of mmap_cache is racy (page stealer can race with other code
> > +1. find_vma() makes use of, and updates, the mmap_cache pointers hint.
> > +The updates of mmap_cache is racy (page stealer can race with other code
> > that invokes find_vma with mmap_sem held), but that is okay, since it
> > is a hint. This can be fixed, if desired, by having find_vma grab the
> > page_table_lock.
> > diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h
> > index fb5e4c6..38cc7fc 100644
> > --- a/arch/unicore32/include/asm/mmu_context.h
> > +++ b/arch/unicore32/include/asm/mmu_context.h
> > @@ -73,7 +73,7 @@ do { \
> > else \
> > mm->mmap = NULL; \
> > rb_erase(&high_vma->vm_rb, &mm->mm_rb); \
> > - mm->mmap_cache = NULL; \
> > + vma_clear_caches(mm); \
> > mm->map_count--; \
> > remove_vma(high_vma); \
> > } \
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 8b6e55e..2c0f8ed 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1534,8 +1534,21 @@ static inline void mm_populate(unsigned long addr, unsigned long len)
> > /* Ignore errors */
> > (void) __mm_populate(addr, len, 1);
> > }
> > +
> > +static inline void vma_clear_caches(struct mm_struct *mm)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < NR_VMA_CACHES; i++)
> > + mm->mmap_cache[i] = NULL;
>
> Just curious: does GCC manage to open-code this as two stores of NULL?
>
> > +}
> > #else
> > static inline void mm_populate(unsigned long addr, unsigned long len) {}
> > +
> > +static inline void vma_clear_caches(struct mm_struct *mm)
> 1> +{
> > + mm->mmap_cache = NULL;
> > +}
> > #endif
> >
> > /* These take the mm semaphore themselves */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index d9851ee..7f92835 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -322,12 +322,23 @@ struct mm_rss_stat {
> > atomic_long_t count[NR_MM_COUNTERS];
> > };
> >
> > +
> > +#ifdef CONFIG_MMU
> > +enum {
> > + VMA_LAST_USED, /* last find_vma result */
> > + VMA_LARGEST, /* vma that contains most address */
> > + NR_VMA_CACHES
> > +};
> > +#endif
> > +
> > struct kioctx_table;
> > struct mm_struct {
> > struct vm_area_struct * mmap; /* list of VMAs */
> > struct rb_root mm_rb;
> > - struct vm_area_struct * mmap_cache; /* last find_vma result */
> > -#ifdef CONFIG_MMU
> > +#ifndef CONFIG_MMU
> > + struct vm_area_struct *mmap_cache; /* last find_vma result */
> > +#else
> > + struct vm_area_struct *mmap_cache[NR_VMA_CACHES];
>
> I think the CONFIG_MMU assymetry in the data structure is rather ugly.
>
> Why not make it a single-entry enum in the !CONFIG_MMU case? To the
> compiler a single-entry array should be the same as a pointer field.
>
> That would eliminate most of the related #ifdefs AFAICS.
Yes that's a lot better.
>
> > unsigned long (*get_unmapped_area) (struct file *filp,
> > unsigned long addr, unsigned long len,
> > unsigned long pgoff, unsigned long flags);
> > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
> > index 0506d44..d9d72e4 100644
> > --- a/kernel/debug/debug_core.c
> > +++ b/kernel/debug/debug_core.c
> > @@ -221,13 +221,26 @@ int __weak kgdb_skipexception(int exception, struct pt_regs *regs)
> > */
> > static void kgdb_flush_swbreak_addr(unsigned long addr)
> > {
> > + struct mm_struct *mm = current->mm;
> > if (!CACHE_FLUSH_IS_SAFE)
> > return;
> >
> > - if (current->mm && current->mm->mmap_cache) {
> > - flush_cache_range(current->mm->mmap_cache,
> > +#ifdef CONFIG_MMU
> > + if (mm) {
> > + int i;
> > +
> > + for (i = 0; i < NR_VMA_CACHES; i++)
> > + if (mm->mmap_cache[i])
> > + flush_cache_range(mm->mmap_cache[i],
> > + addr,
> > + addr + BREAK_INSTR_SIZE);
>
> (Nit: please use curly braces for for such multi-line statements.)
>
> > + }
> > +#else
> > + if (mm && mm->mmap_cache) {
> > + flush_cache_range(mm->mmap_cache,
> > addr, addr + BREAK_INSTR_SIZE);
> > }
> > +#endif
>
> Btw., this #ifdef would be unified with my suggested data structure
> variant as well.
>
> > /* Force flush instruction cache if it was outside the mm */
> > flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
> > }
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 086fe73..7b92666 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -363,8 +363,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >
> > mm->locked_vm = 0;
> > mm->mmap = NULL;
> > - mm->mmap_cache = NULL;
> > mm->map_count = 0;
> > + vma_clear_caches(mm);
> > cpumask_clear(mm_cpumask(mm));
> > mm->mm_rb = RB_ROOT;
> > rb_link = &mm->mm_rb.rb_node;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 9d54851..29c3fc0 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -676,14 +676,17 @@ static inline void
> > __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
> > struct vm_area_struct *prev)
> > {
> > + int i;
> > struct vm_area_struct *next;
> >
> > vma_rb_erase(vma, &mm->mm_rb);
> > prev->vm_next = next = vma->vm_next;
> > if (next)
> > next->vm_prev = prev;
> > - if (mm->mmap_cache == vma)
> > - mm->mmap_cache = prev;
> > +
> > + for (i = 0; i < NR_VMA_CACHES; i++)
> > + if (mm->mmap_cache[i] == vma)
> > + mm->mmap_cache[i] = prev;
>
> (Nit: missing curly braces.)
>
> Also, I don't think setting the cache value back to 'prev' is valid in the
> VMA_LARGEST case. The likelihood that it's the second largest VMA is
> remote.
>
> The right action here would be to set it to NULL.
>
> For VMA_LAST_USED setting it to 'prev' seems justified.
>
> > }
> >
> > /*
> > @@ -1972,34 +1975,47 @@ EXPORT_SYMBOL(get_unmapped_area);
> > /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
> > struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > {
> > + unsigned long currlen = 0;
>
> (Nit: I don't think 'currlen' really explains the role of the variable.
> 'max_len' would be better?)
>
> > + struct rb_node *rb_node;
> > struct vm_area_struct *vma = NULL;
> >
> > - /* Check the cache first. */
> > - /* (Cache hit rate is typically around 35%.) */
> > - vma = ACCESS_ONCE(mm->mmap_cache);
> > - if (!(vma && vma->vm_end > addr && vma->vm_start <= addr)) {
> > - struct rb_node *rb_node;
> > + /* Check the cache first */
> > + vma = ACCESS_ONCE(mm->mmap_cache[VMA_LAST_USED]);
> > + if (vma && vma->vm_end > addr && vma->vm_start <= addr)
> > + goto ret;
> >
> > - rb_node = mm->mm_rb.rb_node;
> > - vma = NULL;
> > + vma = ACCESS_ONCE(mm->mmap_cache[VMA_LARGEST]);
> > + if (vma) {
> > + if (vma->vm_end > addr && vma->vm_start <= addr)
> > + goto ret;
> > + currlen = vma->vm_end - vma->vm_start;
> > + }
> >
> > - while (rb_node) {
> > - struct vm_area_struct *vma_tmp;
> > -
> > - vma_tmp = rb_entry(rb_node,
> > - struct vm_area_struct, vm_rb);
> > -
> > - if (vma_tmp->vm_end > addr) {
> > - vma = vma_tmp;
> > - if (vma_tmp->vm_start <= addr)
> > - break;
> > - rb_node = rb_node->rb_left;
> > - } else
> > - rb_node = rb_node->rb_right;
> > - }
> > - if (vma)
> > - mm->mmap_cache = vma;
> > + /* Bad cache! iterate rbtree */
>
> (Nit: the cache is not 'bad', we just didn't hit it.)
>
> > + rb_node = mm->mm_rb.rb_node;
> > + vma = NULL;
> > +
> > + while (rb_node) {
> > + struct vm_area_struct *vma_tmp;
> > +
> > + vma_tmp = rb_entry(rb_node,
> > + struct vm_area_struct, vm_rb);
>
> (Nit: in such cases a single, slightly-longer-than-80col line is IMHO a
> better solution than such an artificial line-break.)
>
> > +
> > + if (vma_tmp->vm_end > addr) {
> > + vma = vma_tmp;
> > + if (vma_tmp->vm_start <= addr)
> > + break;
> > + rb_node = rb_node->rb_left;
> > + } else
> > + rb_node = rb_node->rb_right;
>
> (Nit: unbalanced curly braces.)
>
> > + }
> > +
> > + if (vma) {
> > + ACCESS_ONCE(mm->mmap_cache[VMA_LAST_USED]) = vma;
> > + if (vma->vm_end - vma->vm_start > currlen)
> > + ACCESS_ONCE(mm->mmap_cache[VMA_LARGEST]) = vma;
>
> Would it make sense to not update VMA_LAST_USED if VMA_LARGEST is set?
>
> This would have the advantage of increasing the cache size to two, for the
> common case where there's two vmas used most of the time.
>
> To maximize the hit rate in the general case what we basically want to
> have is an LRU cache, weighted by vma size.
>
> Maybe by expressing it all in that fashion and looking at the hit rate at
> 1, 2, 3 and 4 entries would give us equivalent (or better!) behavior than
> your open-coded variant, with a better idea about how to size it
> precisely.
>
> Note that that approach would get rid of the VMA_LAST_USED/VMA_LARGEST
> distinction in a natural fashion.
>
> Obviously, if the LRU logic gets too complex then it probably won't bring
> us any benefits compared to a primitive front-entry cache, so all this is
> a delicate balance ... hence my previous question about
> cycles/instructions saved by hitting the cache.
Will try this, thanks for the suggestions.
Btw, do you suggest using a high level tool such as perf for getting
this data or sprinkling get_cycles() in find_vma() -- I'd think that the
first isn't fine grained enough, while the later will probably variate a
lot from run to run but the ratio should be rather constant.
Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/