Re: [PATCH] list_lru: Prefetch neighboring list entries before acquiring lock

From: Dave Chinner
Date: Fri Dec 01 2017 - 17:03:19 EST


On Fri, Dec 01, 2017 at 09:14:52AM -0500, Waiman Long wrote:
> On 11/30/2017 07:09 PM, Minchan Kim wrote:
> > On Thu, Nov 30, 2017 at 12:47:36PM -0800, Andrew Morton wrote:
> >> On Thu, 30 Nov 2017 08:54:04 -0500 Waiman Long <longman@xxxxxxxxxx> wrote:
> >>
> >>>> And, from that perspective, the racy shortcut in the proposed patch
> >>>> is wrong, too. Prefetch is fine, but in general shortcutting list
> >>>> empty checks outside the internal lock isn't.
> >>> For the record, I add one more list_empty() check at the beginning of
> >>> list_lru_del() in the patch for 2 purpose:
> >>> 1. it allows the code to bail out early.
> >>> 2. It make sure the cacheline of the list_head entry itself is loaded.
> >>>
> >>> Other than that, I only add a likely() qualifier to the existing
> >>> list_empty() check within the lock critical region.
> >> But it sounds like Dave thinks that unlocked check should be removed?
> >>
> >> How does this adendum look?
> >>
> >> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >> Subject: list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> >>
> >> include prefetch.h, remove unlocked list_empty() test, per Dave
> >>
> >> Cc: Dave Chinner <david@xxxxxxxxxxxxx>
> >> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> >> Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> >> Cc: Waiman Long <longman@xxxxxxxxxx>
> >> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >> ---
> >>
> >> mm/list_lru.c | 5 ++---
> >> 1 file changed, 2 insertions(+), 3 deletions(-)
> >>
> >> diff -puN mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix mm/list_lru.c
> >> --- a/mm/list_lru.c~list_lru-prefetch-neighboring-list-entries-before-acquiring-lock-fix
> >> +++ a/mm/list_lru.c
> >> @@ -8,6 +8,7 @@
> >> #include <linux/module.h>
> >> #include <linux/mm.h>
> >> #include <linux/list_lru.h>
> >> +#include <linux/prefetch.h>
> >> #include <linux/slab.h>
> >> #include <linux/mutex.h>
> >> #include <linux/memcontrol.h>
> >> @@ -135,13 +136,11 @@ bool list_lru_del(struct list_lru *lru,
> >> /*
> >> * Prefetch the neighboring list entries to reduce lock hold time.
> >> */
> >> - if (unlikely(list_empty(item)))
> >> - return false;
> >> prefetchw(item->prev);
> >> prefetchw(item->next);
> >>
> >> spin_lock(&nlru->lock);
> >> - if (likely(!list_empty(item))) {
> >> + if (!list_empty(item)) {
> >> l = list_lru_from_kmem(nlru, item);
> >> list_del_init(item);
> >> l->nr_items--;
> > If we cannot guarantee it's likely !list_empty, prefetch with NULL pointer
> > would be harmful by the lesson we have learned.
> >
> > https://lwn.net/Articles/444336/
>
> FYI, when list_empty() is true, it just mean the links are pointing to
> list entry itself. The pointers will never be NULL. So that won't cause
> the NULL prefetch problem mentioned in the article.

Sure, but that misses the larger point of the article in that there
are many unpredictable side effects of adding prefetches, the least
of which is that the result is CPU specific. Some CPUs improve,
others regress, and there's no predicting which side of the ledger
any given CPU falls on.

So from that perspective, we should consider manual prefetching
harmful, similar to the way that likely/unlikely is generally
considered harmful. i.e. it's pretty much impossible for a
programmer to get right in all situations.

>
> > So, with considering list_lru_del is generic library, it cannot see
> > whether a workload makes heavy lock contentions or not.
> > Maybe, right place for prefetching would be in caller, not in library
> > itself.
>
> Yes, the prefetch operations will add some overhead to the whole
> deletion operation when the lock isn't contended, but that is usually
> rather small compared with the atomic ops involved in the locking
> operation itself. On the other hand, the performance gain will be
> noticeable when the lock is contended. I will ran some performance
> measurement and report the results later.

Given the extreme fuzziness of the benefit of manual prefetching,
you'll need to:
a) document the test
b) run the test on mulitple architectures
c) run the test on mulitple different CPU models within
the x86-64 architecture
d) show that it works in general for a majority of the
platforms and CPUs you benched, and not just for your
microbenchmark.

Basically, if there's historic context with people like Ingo saying
"prefetching is toxic" then there's a bloody high bar you need to
get over to add manual prefetching anywhere...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx