Re: [lttng-dev] [-stable 3.8.1 performance regression] madvisePOSIX_FADV_DONTNEED

From: Mathieu Desnoyers
Date: Mon Jun 17 2013 - 22:15:37 EST


* Andrew Morton (akpm@xxxxxxxxxxxxxxxxxxxx) wrote:
> On Mon, 17 Jun 2013 17:39:36 -0400 Rapha__l Beamonte <raphael.beamonte@xxxxxxxxx> wrote:
>
> > 2013/6/17 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >
> > > That change wasn't terribly efficient - if there are any unpopulated
> > > pages in the range (which is quite likely), fadvise() will now always
> > > call invalidate_mapping_pages() a second time.
> > >
> > > Perhaps this is fixable. Say, make lru_add_drain_all() return a
> > > success code, or even teach lru_add_drain_all() to return a code
> > > indicating that one of the spilled pages was (or might have been) on a
> > > particular mapping.
> > >
> >
> > Following our tests results, that was the call to lru_add_drain_all() that
> > causes the problem. The second call to invalidate_mapping_pages() isn't
> > really important. We tried to compile a kernel with the commit introducing
> > this change but with the "lru_add_drain_all()" line removed, and the
> > problem disappeared, even if we called two times invalidate_mapping_pages()
> > (as the rest of the commit was still here).
>
> Ah, OK, schedule_on_each_cpu() could certainly do that - it has to wait
> for every CPU to context switch and schedule the worker function.
>
> There's a lot we could do here. Such as not doing the schedule_work()
> at all for a cpu which has an empty lru_add_pvec.

First approach proposed, submitted as RFC. Compile-tested only.

---
mm/swap.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)

Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -652,7 +652,40 @@ static void lru_add_drain_per_cpu(struct
*/
int lru_add_drain_all(void)
{
- return schedule_on_each_cpu(lru_add_drain_per_cpu);
+ int cpu;
+ struct work_struct __percpu *works;
+
+ works = alloc_percpu(struct work_struct);
+ if (!works)
+ return -ENOMEM;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ struct pagevec *pvec;
+ int lru;
+
+ INIT_WORK(work, lru_add_drain_per_cpu);
+ for_each_lru(lru) {
+ pvec = &pvecs[lru - LRU_BASE];
+ if (pagevec_count(pvec)) {
+ schedule_work_on(cpu, work);
+ break;
+ }
+ }
+ }
+
+ for_each_online_cpu(cpu) {
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ if (work_pending(work))
+ flush_work(work);
+ }
+
+ put_online_cpus();
+ free_percpu(works);
+ return 0;
}

/*

> Or even pass down
> the address_space and only schedule the work for CPUs which have a page
> from *this mapping* in their lru_add_pvec. That will all be highly
> racy, but as long as the failure mode is "flushed unnecessarily" then
> that's OK.

Second approach, submitted as RFC with some questions left unanswered
in the code. Compile-tested only.

---
include/linux/swap.h | 1
mm/fadvise.c | 2 -
mm/swap.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+), 1 deletion(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h
+++ linux/include/linux/swap.h
@@ -242,6 +242,7 @@ extern void mark_page_accessed(struct pa
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern int lru_add_drain_all(void);
+extern int lru_add_drain_mapping(struct address_space *mapping);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_page(struct page *page);
extern void swap_setup(void);
Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -689,6 +689,68 @@ int lru_add_drain_all(void)
}

/*
+ * Returns 0 for success
+ */
+int lru_add_drain_mapping(struct address_space *mapping)
+{
+ int cpu;
+ struct work_struct __percpu *works;
+
+ works = alloc_percpu(struct work_struct);
+ if (!works)
+ return -ENOMEM;
+
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ struct pagevec *pvec;
+ int lru;
+
+ INIT_WORK(work, lru_add_drain_per_cpu);
+ for_each_lru(lru) {
+ int i;
+
+ pvec = &pvecs[lru - LRU_BASE];
+ /*
+ * Race failure mode is to flush unnecessarily.
+ * We use PAGEVEC_SIZE rather than pvec->nr to
+ * stay on the safe-side wrt pvec resize.
+ */
+ for (i = 0; i < PAGEVEC_SIZE; i++) {
+ struct page *page;
+
+ /*
+ * Racy access to page. TODO: is it OK
+ * to access it from the remote CPU's
+ * lru without any kind of ownership or
+ * synchronization ?
+ */
+ page = ACCESS_ONCE(pvec->pages[i]);
+ if (!pvec->pages[i])
+ continue;
+ if (page->mapping == mapping) {
+ schedule_work_on(cpu, work);
+ goto next_cpu;
+ }
+ }
+ }
+ next_cpu: ; /* TODO: coding ugliness */
+ }
+
+ for_each_online_cpu(cpu) {
+ struct work_struct *work = per_cpu_ptr(works, cpu);
+ if (work_pending(work))
+ flush_work(work);
+ }
+
+ put_online_cpus();
+ free_percpu(works);
+ return 0;
+}
+
+/*
* Batched page_cache_release(). Decrement the reference count on all the
* passed pages. If it fell to zero then remove the page from the LRU and
* free it.
Index: linux/mm/fadvise.c
===================================================================
--- linux.orig/mm/fadvise.c
+++ linux/mm/fadvise.c
@@ -132,7 +132,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, l
* pagevecs and try again.
*/
if (count < (end_index - start_index + 1)) {
- lru_add_drain_all();
+ lru_add_drain_mapping(mapping);
invalidate_mapping_pages(mapping, start_index,
end_index);
}

Feedback is welcome,

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/