Re: [RFC mm][PATCH 2/5] percpu cached mm counter

From: Minchan Kim
Date: Thu Dec 10 2009 - 19:40:19 EST


Hi, Kame.

It looks good than older one. :)

On Thu, Dec 10, 2009 at 4:34 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> Now, mm's counter information is updated by atomic_long_xxx() functions if
> USE_SPLIT_PTLOCKS is defined. This causes cache-miss when page faults happens
> simultaneously in prural cpus. (Almost all process-shared objects is...)
>
> Considering accounting per-mm page usage more, one of problems is cost of
> this counter.
>
> This patch implements per-cpu mm cache. This per-cpu cache is loosely
> synchronized with mm's counter. Current design is..
>
> Â- prepare per-cpu object curr_mmc. curr_mmc containes pointer to mm and
> Â Âarray of counters.
> Â- At page fault,
> Â Â * if curr_mmc.mm != NULL, update curr_mmc.mm counter.
> Â Â * if curr_mmc.mm == NULL, fill curr_mmc.mm = current->mm and account 1.
> Â- At schedule()
> Â Â * if curr_mm.mm != NULL, synchronize and invalidate cached information.
> Â Â * if curr_mmc.mm == NULL, nothing to do.
>
> By this.
> Â- no atomic ops, which tends to cache-miss, under page table lock.
> Â- mm->counters are synchronized when schedule() is called.
> Â- No bad thing to read-side.
>
> Concern:
> Â- added cost to schedule().
>
> Micro Benchmark:
> Âmeasured the number of page faults with 2 threads on 2 sockets.
>
> ÂBefore:
> Â Performance counter stats for './multi-fault 2' (5 runs):
>
>    45122351 Âpage-faults        Â( +-  1.125% )
>   Â989608571 Âcache-references      ( +-  1.198% )
>   Â205308558 Âcache-misses        ( +-  0.159% )
>  29263096648639268 Âbus-cycles         ( +-  0.004% )
>
>  60.003427500 Âseconds time elapsed  ( +-  0.003% )
>
> ÂAfter:
> Â ÂPerformance counter stats for './multi-fault 2' (5 runs):
>
>    46997471 Âpage-faults        Â( +-  0.720% )
>   1004100076 Âcache-references      ( +-  0.734% )
>   Â180959964 Âcache-misses        ( +-  0.374% )
>  29263437363580464 Âbus-cycles         ( +-  0.002% )
>
>  60.003315683 Âseconds time elapsed  ( +-  0.004% )
>
> Â cachemiss/page faults is reduced from 4.55 miss/faults to be 3.85miss/faults
>
> Â This microbencmark doesn't do usual behavior (page fault ->madvise(DONTNEED)
> Â but reducing cache-miss cost sounds good to me even if it's very small.
>
> Changelog 2009/12/09:
> Â- loosely update curr_mmc.mm at the 1st page fault.
> Â- removed hooks in tick.(update_process_times)
> Â- exported curr_mmc and check curr_mmc.mm directly.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> ---
> Âinclude/linux/mm.h    |  37 ++++++++++++++++++++++++++++
> Âinclude/linux/mm_types.h | Â 12 +++++++++
> Âkernel/exit.c      Â|  Â3 +-
> Âkernel/sched.c      |  Â6 ++++
> Âmm/memory.c       Â|  60 ++++++++++++++++++++++++++++++++++++++++-------
> Â5 files changed, 108 insertions(+), 10 deletions(-)
>
> Index: mmotm-2.6.32-Dec8/include/linux/mm_types.h
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/include/linux/mm_types.h
> +++ mmotm-2.6.32-Dec8/include/linux/mm_types.h
> @@ -297,4 +297,16 @@ struct mm_struct {
> Â/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> Â#define mm_cpumask(mm) (&(mm)->cpu_vm_mask)
>
> +#if USE_SPLIT_PTLOCKS
> +/*
> + * percpu object used for caching thread->mm information.
> + */
> +struct pcp_mm_cache {
> + Â Â Â struct mm_struct *mm;
> + Â Â Â unsigned long counters[NR_MM_COUNTERS];
> +};
> +
> +DECLARE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> +#endif
> +
> Â#endif /* _LINUX_MM_TYPES_H */
> Index: mmotm-2.6.32-Dec8/include/linux/mm.h
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/include/linux/mm.h
> +++ mmotm-2.6.32-Dec8/include/linux/mm.h
> @@ -883,7 +883,16 @@ static inline void set_mm_counter(struct
>
> Âstatic inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> Â{
> - Â Â Â return (unsigned long)atomic_long_read(&(mm)->counters[member]);
> + Â Â Â long ret;
> + Â Â Â /*
> + Â Â Â Â* Because this counter is loosely synchronized with percpu cached
> + Â Â Â Â* information, it's possible that value gets to be minus. For user's
> + Â Â Â Â* convenience/sanity, avoid returning minus.
> + Â Â Â Â*/
> + Â Â Â ret = atomic_long_read(&(mm)->counters[member]);
> + Â Â Â if (unlikely(ret < 0))
> + Â Â Â Â Â Â Â return 0;
> + Â Â Â return (unsigned long)ret;
> Â}

Now, your sync point is only task switching time.
So we can't show exact number if many counting of mm happens
in short time.(ie, before context switching).
It isn't matter?

>
> Âstatic inline void add_mm_counter(struct mm_struct *mm, int member, long value)
> @@ -900,6 +909,25 @@ static inline void dec_mm_counter(struct
> Â{
> Â Â Â Âatomic_long_dec(&(mm)->counters[member]);
> Â}
> +extern void __sync_mm_counters(struct mm_struct *mm);
> +/* Called under non-preemptable context, for syncing cached information */
> +static inline void sync_mm_counters_atomic(void)
> +{
> + Â Â Â struct mm_struct *mm;
> +
> + Â Â Â mm = percpu_read(curr_mmc.mm);
> + Â Â Â if (mm) {
> + Â Â Â Â Â Â Â __sync_mm_counters(mm);
> + Â Â Â Â Â Â Â percpu_write(curr_mmc.mm, NULL);
> + Â Â Â }
> +}
> +/* called at thread exit */
> +static inline void exit_mm_counters(void)
> +{
> + Â Â Â preempt_disable();
> + Â Â Â sync_mm_counters_atomic();
> + Â Â Â preempt_enable();
> +}
>
> Â#else Â/* !USE_SPLIT_PTLOCKS */
> Â/*
> @@ -931,6 +959,13 @@ static inline void dec_mm_counter(struct
> Â Â Â Âmm->counters[member]--;
> Â}
>
> +static inline void sync_mm_counters_atomic(void)
> +{
> +}
> +
> +static inline void exit_mm_counters(void)
> +{
> +}
> Â#endif /* !USE_SPLIT_PTLOCKS */
>
> Â#define get_mm_rss(mm) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> Index: mmotm-2.6.32-Dec8/mm/memory.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/mm/memory.c
> +++ mmotm-2.6.32-Dec8/mm/memory.c
> @@ -121,6 +121,50 @@ static int __init init_zero_pfn(void)
> Â}
> Âcore_initcall(init_zero_pfn);
>
> +#if USE_SPLIT_PTLOCKS
> +
> +DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> +
> +void __sync_mm_counters(struct mm_struct *mm)
> +{
> + Â Â Â struct pcp_mm_cache *mmc = &per_cpu(curr_mmc, smp_processor_id());
> + Â Â Â int i;
> +
> + Â Â Â for (i = 0; i < NR_MM_COUNTERS; i++) {

The cost depends on NR_MM_COUNTER.
Now, it's low but we might add the more counts in pcp_mm_cache.
Then, If we don't change any count in many counts, we don't need to loop
unnecessary. we will remove this with change flag of pcp_mm_cache.
But, change flag cmp/updating overhead is also ugly. So, it would be rather
overkill in now. How about leaving the NOTE ?

/* NOTE :
* We have to rethink for reducing overhead if we start to
* add many counts in pcp_mm_cache.
*/

> + Â Â Â Â Â Â Â if (mmc->counters[i] != 0) {
> + Â Â Â Â Â Â Â Â Â Â Â atomic_long_add(mmc->counters[i], &mm->counters[i]);
> + Â Â Â Â Â Â Â Â Â Â Â mmc->counters[i] = 0;
> + Â Â Â Â Â Â Â }
> + Â Â Â }
> + Â Â Â return;
> +}
> +/*
> + * This add_mm_counter_fast() works well only when it's expexted that

expexted => expected :)

> + * mm == current->mm. So, use of this function is limited under memory.c
> + * This add_mm_counter_fast() is called under page table lock.
> + */
> +static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> +{
> + Â Â Â struct mm_struct *cached = percpu_read(curr_mmc.mm);
> +
> + Â Â Â if (likely(cached == mm)) { /* fast path */
> + Â Â Â Â Â Â Â percpu_add(curr_mmc.counters[member], val);
> + Â Â Â } else if (mm == current->mm) { /* 1st page fault in this period */
> + Â Â Â Â Â Â Â percpu_write(curr_mmc.mm, mm);
> + Â Â Â Â Â Â Â percpu_write(curr_mmc.counters[member], val);
> + Â Â Â } else /* page fault via side-path context (get_user_pages()) */
> + Â Â Â Â Â Â Â add_mm_counter(mm, member, val);
> +}
> +
> +#define inc_mm_counter_fast(mm, member) Â Â Â Âadd_mm_counter_fast(mm, member, 1)
> +#define dec_mm_counter_fast(mm, member) Â Â Â Âadd_mm_counter_fast(mm, member, -1)
> +#else
> +
> +#define inc_mm_counter_fast(mm, member) Â Â Â Âinc_mm_counter(mm, member)
> +#define dec_mm_counter_fast(mm, member) Â Â Â Âdec_mm_counter(mm, member)
> +
> +#endif
> +
> Â/*
> Â* If a p?d_bad entry is found while walking page tables, report
> Â* the error, before resetting entry to p?d_none. ÂUsually (but
> @@ -1541,7 +1585,7 @@ static int insert_page(struct vm_area_st
>
> Â Â Â Â/* Ok, finally just insert the thing.. */
> Â Â Â Âget_page(page);
> - Â Â Â inc_mm_counter(mm, MM_FILEPAGES);
> + Â Â Â inc_mm_counter_fast(mm, MM_FILEPAGES);
> Â Â Â Âpage_add_file_rmap(page);
> Â Â Â Âset_pte_at(mm, addr, pte, mk_pte(page, prot));
>
> @@ -2177,11 +2221,11 @@ gotten:
> Â Â Â Âif (likely(pte_same(*page_table, orig_pte))) {
> Â Â Â Â Â Â Â Âif (old_page) {
> Â Â Â Â Â Â Â Â Â Â Â Âif (!PageAnon(old_page)) {
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â dec_mm_counter(mm, MM_FILEPAGES);
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter(mm, MM_ANONPAGES);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â dec_mm_counter_fast(mm, MM_FILEPAGES);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter_fast(mm, MM_ANONPAGES);
> Â Â Â Â Â Â Â Â Â Â Â Â}
> Â Â Â Â Â Â Â Â} else
> - Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter(mm, MM_ANONPAGES);
> + Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter_fast(mm, MM_ANONPAGES);
> Â Â Â Â Â Â Â Âflush_cache_page(vma, address, pte_pfn(orig_pte));
> Â Â Â Â Â Â Â Âentry = mk_pte(new_page, vma->vm_page_prot);
> Â Â Â Â Â Â Â Âentry = maybe_mkwrite(pte_mkdirty(entry), vma);
> @@ -2614,7 +2658,7 @@ static int do_swap_page(struct mm_struct
> Â Â Â Â * discarded at swap_free().
> Â Â Â Â */
>
> - Â Â Â inc_mm_counter(mm, MM_ANONPAGES);
> + Â Â Â inc_mm_counter_fast(mm, MM_ANONPAGES);
> Â Â Â Âpte = mk_pte(page, vma->vm_page_prot);
> Â Â Â Âif ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> Â Â Â Â Â Â Â Âpte = maybe_mkwrite(pte_mkdirty(pte), vma);
> @@ -2698,7 +2742,7 @@ static int do_anonymous_page(struct mm_s
> Â Â Â Âif (!pte_none(*page_table))
> Â Â Â Â Â Â Â Âgoto release;
>
> - Â Â Â inc_mm_counter(mm, MM_ANONPAGES);
> + Â Â Â inc_mm_counter_fast(mm, MM_ANONPAGES);
> Â Â Â Âpage_add_new_anon_rmap(page, vma, address);
> Âsetpte:
> Â Â Â Âset_pte_at(mm, address, page_table, entry);
> @@ -2852,10 +2896,10 @@ static int __do_fault(struct mm_struct *
> Â Â Â Â Â Â Â Âif (flags & FAULT_FLAG_WRITE)
> Â Â Â Â Â Â Â Â Â Â Â Âentry = maybe_mkwrite(pte_mkdirty(entry), vma);
> Â Â Â Â Â Â Â Âif (anon) {
> - Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter(mm, MM_ANONPAGES);
> + Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter_fast(mm, MM_ANONPAGES);
> Â Â Â Â Â Â Â Â Â Â Â Âpage_add_new_anon_rmap(page, vma, address);
> Â Â Â Â Â Â Â Â} else {
> - Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter(mm, MM_FILEPAGES);
> + Â Â Â Â Â Â Â Â Â Â Â inc_mm_counter_fast(mm, MM_FILEPAGES);
> Â Â Â Â Â Â Â Â Â Â Â Âpage_add_file_rmap(page);
> Â Â Â Â Â Â Â Â Â Â Â Âif (flags & FAULT_FLAG_WRITE) {
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âdirty_page = page;
> Index: mmotm-2.6.32-Dec8/kernel/sched.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/kernel/sched.c
> +++ mmotm-2.6.32-Dec8/kernel/sched.c
> @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> Â Â Â Âtrace_sched_switch(rq, prev, next);
> Â Â Â Âmm = next->mm;
> Â Â Â Âoldmm = prev->active_mm;
> +
> Â Â Â Â/*
> Â Â Â Â * For paravirt, this is coupled with an exit in switch_to to
> Â Â Â Â * combine the page table reload and the switch backend into
> @@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
>
> Â Â Â Âif (sched_feat(HRTICK))
> Â Â Â Â Â Â Â Âhrtick_clear(rq);
> + Â Â Â /*
> + Â Â Â Â* sync/invaldidate per-cpu cached mm related information
> + Â Â Â Â* before taling rq->lock. (see include/linux/mm.h)

taling => taking

> + Â Â Â Â*/
> + Â Â Â sync_mm_counters_atomic();

It's my above concern.
before the process schedule out, we could get the wrong info.
It's not realistic problem?


>
> Â Â Â Âspin_lock_irq(&rq->lock);
> Â Â Â Âupdate_rq_clock(rq);
> Index: mmotm-2.6.32-Dec8/kernel/exit.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/kernel/exit.c
> +++ mmotm-2.6.32-Dec8/kernel/exit.c
> @@ -942,7 +942,8 @@ NORET_TYPE void do_exit(long code)
> Â Â Â Â Â Â Â Âprintk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcurrent->comm, task_pid_nr(current),
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âpreempt_count());
> -
> + Â Â Â /* synchronize per-cpu cached mm related information before account */
> + Â Â Â exit_mm_counters();
> Â Â Â Âacct_update_integrals(tsk);
>
> Â Â Â Âgroup_dead = atomic_dec_and_test(&tsk->signal->live);
>
>



--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/