Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

From: Feng Tang
Date: Mon Jun 07 2021 - 02:06:17 EST

Next message: lijian_8010a29: "[PATCH] fs: read_write: Fix a typo"
Previous message: Krzysztof Kozlowski: "Re: [PATCH v6 08/14] memory: tegra: Enable compile testing for all drivers"
In reply to: Waiman Long: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Next in thread: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Jun 06, 2021 at 12:20:46PM -0700, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
>
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> >
> > * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
> > data structure change
> >
> > - old kernel
> >
> > - first cacheline
> > mmap_lock->count (75%)
> > mm->mapcount (14%)
> >
> > - second cacheline
> > mmap_lock->owner (97%)
> >
> > - new kernel
> >
> > mainly in the cacheline of 'mmap_lock'
> >
> > mmap_lock->count (~2%)
> > mmap_lock->owner (95%)
>
> Oooh.
>
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
>
> The rw_semaphore structure looks like this:
>
> struct rw_semaphore {
> atomic_long_t count;
> atomic_long_t owner;
> struct optimistic_spin_queue osq; /* spinner MCS lock */
> ...
>
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
>
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?
>
> Anyway, I'm certainly ok with the patch that just moves
> 'write_protect_seq' down, it might be worth commenting about how this
> is about some very special cache layout of the mmap_sem part of the
> structure.
>
> That said, this means that it all is very subtle dependent on a lot of
> kernel config options, and I'm not sure how relevant the exact kernel
> options are that the test robot has been using. But even if this is
> just a "kernel test robot reports", I think it's an interesting case
> and worth a comment for when this happens next time...

There are 3 kernel config options before 'mmap_lock' (inside 'mm_struct'):

CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

0day's default kernel config is similar to RHEL-8.3, which has all
these three enabled. IIUC, the first 2 options are 'y' for many common
configs, while 'CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES' is only available
on x86 system.

Please review the updated patch, thanks

- Feng

Next message: lijian_8010a29: "[PATCH] fs: read_write: Fix a typo"
Previous message: Krzysztof Kozlowski: "Re: [PATCH v6 08/14] memory: tegra: Enable compile testing for all drivers"
In reply to: Waiman Long: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Next in thread: Linus Torvalds: "Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]