Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

From: John Hubbard
Date: Fri Jun 04 2021 - 13:58:20 EST

Next message: Kees Cook: "Re: [PATCH v3 1/1] pgo: Fix allocate_node() v2"
Previous message: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
In reply to: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Next in thread: Feng Tang: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/4/21 12:52 AM, Feng Tang wrote:
...

The perf data doesn't even mention any of the GUP paths, and on the
pure fork path the biggest impact would be:

(a) maybe "struct mm_struct" changed in size or had a different cache layout

Yes, this seems to be the cause of the regression.

The test case is many thread are doing map/unmap at the same time,
so the process's rw_semaphore 'mmap_lock' is highly contended.

Before the patch (with 0day's kconfig), the mmap_lock is separated
into 2 cachelines, the 'count' is in one line, and the other members
sit in the next line, so it luckily avoid some cache bouncing. After

Wow! That's quite a fortunate layout to land on by accident. Almost
makes me wonder if mmap_lock should be designed to do that, but it's
probably even better to just keep working on having a less contended
mmap_lock.

I *suppose* it's worth trying to keep this fragile layout in place,
but it is a landmine for anyone who touches mm_struct. And the struct
is so large already that I'm not sure a comment warning would even
be noticed. Anyway...

the patch, the 'mmap_lock' is pushed into one cacheline, which may
cause the regression.

Below is the pahole info:

- before the patch

spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head mmlist; /* 160 16 */
long unsigned int hiwater_rss; /* 176 8 */

- after the patch

spinlock_t page_table_lock; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct rw_semaphore mmap_lock; /* 128 40 */
struct list_head mmlist; /* 168 16 */
long unsigned int hiwater_rss; /* 184 8 */

perf c2c log can also confirm this.

We've tried some patch, which can restore the regerssion. As the
newly added member 'write_protect_seq' is 4 bytes long, and putting
it into an existing 4 bytes long hole can restore the regeression,
while not affecting most of other member's alignment. Please review
the following patch, thanks!

So, this is a neat little solution, if we agree that it's worth "fixing".

I'm definitely on the fence, but leaning toward, "go for it", because
I like the "no cache effect" result of using up the hole.

Reviewed-by: John Hubbard <jhubbard@xxxxxxxxxx>

thanks,
--
John Hubbard
NVIDIA

- Feng

From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@xxxxxxxxx>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct

Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork), on 64bits system, the hot member
rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
2 cachelines, that its member 'count' sits in one cacheline while
all other members in next cacheline, this naturally reduces some
cache bouncing, and with the commit, the 'mmap_lock' is pushed
into one cacheline, as shown in the pahole info:

- before the commit

spinlock_t page_table_lock; /* 116 4 */
struct rw_semaphore mmap_lock; /* 120 40 */
/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
struct list_head mmlist; /* 160 16 */
long unsigned int hiwater_rss; /* 176 8 */

- after the commit

spinlock_t page_table_lock; /* 124 4 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct rw_semaphore mmap_lock; /* 128 40 */
struct list_head mmlist; /* 168 16 */
long unsigned int hiwater_rss; /* 184 8 */

and it causes one 9.2% regression for 'mmap1' case of will-it-scale
benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
90%+ cpu cycles).

Though relayouting a structure could be a double-edged sword, as it
helps some case, but may hurt other cases. So one solution is the
newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
placing it into an existing 4 bytes hole in 'mm_struct' will not
affect most of other members's alignment, while restoring the
regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
Signed-off-by: Feng Tang <feng.tang@xxxxxxxxx>
---
include/linux/mm_types.h | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..5b55f88 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
*/
atomic_t has_pinned;
- /**
- * @write_protect_seq: Locked when any thread is write
- * protecting pages mapped by this mm to enforce a later COW,
- * for instance during page table copying for fork().
- */
- seqcount_t write_protect_seq;
-
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -480,7 +473,15 @@ struct mm_struct {
unsigned long stack_vm; /* VM_STACK */
unsigned long def_flags;
+ /**
+ * @write_protect_seq: Locked when any thread is write
+ * protecting pages mapped by this mm to enforce a later COW,
+ * for instance during page table copying for fork().
+ */
+ seqcount_t write_protect_seq;
+
spinlock_t arg_lock; /* protect the below fields */
+
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;

Next message: Kees Cook: "Re: [PATCH v3 1/1] pgo: Fix allocate_node() v2"
Previous message: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
In reply to: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Next in thread: Feng Tang: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]