Re: [RFC PATCH v1 00/13] lru_lock scalability
From: Daniel Jordan
Date: Tue Feb 13 2018 - 16:07:42 EST
On 02/08/2018 06:36 PM, Andrew Morton wrote:
On Wed, 31 Jan 2018 18:04:00 -0500 daniel.m.jordan@xxxxxxxxxx wrote:
lru_lock, a per-node* spinlock that protects an LRU list, is one of the
hottest locks in the kernel. On some workloads on large machines, it
shows up at the top of lock_stat.
Do you have details on which callsites are causing the problem? That
would permit us to consider other approaches, perhaps.
Sure, there are two paths where we're seeing contention.
In the first one, a pagevec's worth of anonymous pages are added to
various LRUs when the per-cpu pagevec fills up:
/* take an anonymous page fault, eventually end up at... */
handle_pte_fault
do_anonymous_page
lru_cache_add_active_or_unevictable
lru_cache_add
__lru_cache_add
__pagevec_lru_add
pagevec_lru_move_fn
/* contend on lru_lock */
In the second, one or more pages are removed from an LRU under one hold
of lru_lock:
// userland calls munmap or exit, eventually end up at...
zap_pte_range
__tlb_remove_page // returns true because we eventually hit
// MAX_GATHER_BATCH_COUNT in tlb_next_batch
tlb_flush_mmu_free
free_pages_and_swap_cache
release_pages
/* contend on lru_lock */
For a broader context, we've run decision support benchmarks where
lru_lock (and zone->lock) show long wait times. But we're not the only
ones according to certain kernel comments:
mm/vmscan.c:
* zone_lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
* and working on them outside the LRU lock.
*
* For pagecache intensive workloads, this function is the hottest
* spot in the kernel (apart from copy_*_user functions).
...
static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
include/linux/mmzone.h:
* zone->lock and the [pgdat->lru_lock] are two of the hottest locks in
the kernel.
* So add a wild amount of padding here to ensure that they fall into
separate
* cachelines. ...
Anyway, if you're seeing this lock in your workloads, I'm interested in
hearing what you're running so we can get more real world data on this.