Re: [QUESTION] about the maple tree and current status of mmap_lock scalability

From: Hyeonggon Yoo
Date: Mon Jan 02 2023 - 07:04:51 EST


From: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx>
To: Matthew Wilcox <willy@xxxxxxxxxxxxx>
Cc: linux-mm@xxxxxxxxx, liam.howlett@xxxxxxxxxx, surenb@xxxxxxxxxx,
ldufour@xxxxxxxxxxxxx, michel@xxxxxxxxxxxxxx, vbabka@xxxxxxx,
linux-kernel@xxxxxxxxxxxxxxx
Bcc:
Subject: Re: [QUESTION] about the maple tree and current status of mmap_lock
scalability
Reply-To:
In-Reply-To: <Y63FmaNoLAcdsLaU@xxxxxxxxxxxxxxxxxxxx>

On Thu, Dec 29, 2022 at 04:51:37PM +0000, Matthew Wilcox wrote:
> On Thu, Dec 29, 2022 at 11:22:28PM +0900, Hyeonggon Yoo wrote:
> > On Wed, Dec 28, 2022 at 08:50:36PM +0000, Matthew Wilcox wrote:
> > > The long term goal is even larger than this. Ideally, the VMA tree
> > > would be protected by a spinlock rather than a mutex.
> >
> > You mean replacing mmap_lock rwsem with a spinlock?
> > How is that possible if readers can take it for page fault?
>
> The mmap_lock is taken for many, many things. So the plan was to
> have a spinlock in the maple tree (indeed, there's still one there;
> it's just in a union with the lockdep_map_p). VMA readers would walk
> the tree protected only by RCU; VMA writers would take the spinlock
> while modifying the tree. The work Suren, Liam & I are engaged in
> still uses the mmap semaphore for writers, but we do walk the tree
> under RCU protection.
>

Thanks, I get it. so it's for less overhead for maple tree modification.

> > > While I've read the RCUVM paper, I wouldn't say it was particularly an
> > > inspiration. The Maple Tree is independent of the VM; it's a general
> > > purpose B-tree.
> >
> > My intention was to ask how to synchronize with other VMA operations
> > after the tree traversal with RCU. (Because it's unreasonable to handle
> > page fault in RCU read-side critical section)
> >
> > Per-VMA lock seem to solve it by taking the VMA lock in read mode within
> > RCU read-side critical section.
>
> Right, but it's a little more complex than that. The real "lock" on
> the VMA is actually a sequence count. https://lwn.net/Articles/906852/
> does a good job of explaining it, but the VMA lock is really there as
> a convenient way for the writer to wait for readers to be sufficiently
> "finished" with handling the page fault that any conflicting changes
> will be correctly retired.

Oh, thanks, nice article!

> https://www.infradead.org/~willy/linux/store-free-page-faults.html
> outlines how I intend to proceed from Suren's current scheme (where
> RCU is only used to protect the tree walk) to using RCU for the
> entire page fault.

Thank you for sharing this your outlines.
Okay, so the planned scheme is:

1. Try to process entire page fault under RCU protection
- if failed, goto 2. if succeeded, goto 4.

2. Fall back to Suren's scheme (try to take VMA rwsem)
- if failed, goto 3. if succeeded, goto 4.

3. Fall back to mmap_lock
- goto 4.

4. Finish page fault.

To implement 1, __p*d_alloc() need to take gfp flags
not to sleep in RCU read-side critical section.

What about introducing PF_MEMALLOC_NOWAIT process flag forcing
GFP_NOWAIT | __GFP_NOWARN

similar to PF_MEMALLOC_NO{FS,IO}, looking like this?

Will be less churn.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..77b88f30523b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1725,7 +1725,7 @@ extern struct pid *cad_pid;
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF__HOLE__00004000 0x00004000
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
-#define PF__HOLE__00010000 0x00010000
+#define PF_MEMALLOC_NOWAIT 0x00010000 /* All allocation requests will force GFP_NOWAIT | __GFP_NOWARN */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..4a1196646951 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -204,7 +204,8 @@ static inline gfp_t current_gfp_context(gfp_t flags)
{
unsigned int pflags = READ_ONCE(current->flags);

- if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_PIN))) {
+ if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS
+ | PF_MEMALLOC_PIN | PF_MEMALLOC_NOWAIT))) {
/*
* NOIO implies both NOIO and NOFS and it is a weaker context
* so always make sure it makes precedence
@@ -216,6 +217,8 @@ static inline gfp_t current_gfp_context(gfp_t flags)

if (pflags & PF_MEMALLOC_PIN)
flags &= ~__GFP_MOVABLE;
+ if (pflags & PF_MEMALLOC_NOWAIT)
+ flags = GFP_NOWAIT | __GFP_NOWARN;
}
return flags;
}
@@ -305,6 +308,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+static inline unsigned int memalloc_nowait_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOWAIT;
+ current->flags |= PF_MEMALLOC_NOWAIT;
+ return flags;
+}
+
+static inline void memalloc_nowait_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOWAIT) | flags;


--
Thanks,
Hyeonggon