Re: [BUG] trigger BUG_ON in mas_store_prealloc when low memory

From: John Hsu (許永翰)
Date: Mon Aug 07 2023 - 05:55:12 EST


On Wed, 2023-07-19 at 14:51 -0400, Liam R. Howlett wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> * John Hsu (許永翰) <John.Hsu@xxxxxxxxxxxx> [230712 23:26]:
> > On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
>
> ...
>
> > > > > > > > As you mentioned, mas_preallocate() should allocate
> enough
> > > > > node,
> > > > > > > but there is such functions mas_node_count() in
> > > > > mas_store_prealloc().
> > > > > > > > In mas_node_count() checks whether the *mas* has enough
> > > nodes,
> > > > > and
> > > > > > > allocate memory for node if there was no enough nodes in
> mas.
> > > > > > >
> > > > > > > Right, we call mas_node_count() so that both code paths
> are
> > > used
> > > > > for
> > > > > > > preallocations and regular
> mas_store()/mas_store_gfp(). It
> > > > > shouldn't
> > > > > > > take a significant amount of time to verify there is
> enough
> > > > > nodes.
> > > > > >
> > > > > > Yap..., it didn't take a significant amount of time to
> verify
> > > > > whether
> > > > > > there is enough nodes. The problem is why the flow in
> > > > > mas_node_count
> > > > > > will alloc nodes if there was no enough nodes in mas?
> > > > >
> > > > > What I meant is that both methods use the same call path
> because
> > > > > there
> > > > > is not a reason to duplicate the path. After
> mas_preallocate()
> > > has
> > > > > allocated the nodes needed, the call to check if there is
> enough
> > > > > nodes
> > > > > will be quick.
> > > >
> > > > So whether the purpose of mas_preallocate() is decreasing the
> lock
> > > > retention time?
> > >
> > > It could be, but in this case it's the locking order. We have to
> > > pre-allocate and fail early if we are out of memory, because we
> > > _cannot_
> > > use GFP_KERNEL where we call
> mas_store_prealloc(). mas_preallocate()
> > > will use GFP_KERENL though. We cannot use GFP_KERNEL during the
> > > store
> > > because reclaim may sleep and we cannot sleep holding the locks
> we
> > > need
> > > to hold at the time of the store operation in __vma_adjust().
> > >
> > Yap, if the type of lock is spinlock, the flow shouldn't sleep in
> the
> > critical sections. mmap_lock is implmented by rw_semaphore(mutex).
> Is
> > there any other lock in this section?
>
> Yes, the i_mmap_lock_write(), the anon_vma_lock_write(),
> flush_dcache_mmap_lock().
>
> > > >
> > > > > >
> > > > > > > > I think that if mas_preallocate() allocate enough node,
> why
> > > we
> > > > > > > check the node count and allocate nodes if there was no
> > > enough
> > > > > nodes
> > > > > > > in mas in mas_node_count()?
> > > > > > >
> > > > > > > We check for the above reason.
> > > > > > >
> > > > > >
> > > > > > OK..., this is one of the root cause of this BUG.
> > > > >
> > > > > The root cause is that there was not enough memory for a
> store
> > > > > operation. Regardless of if we check the allocations in the
> > > > > mas_store_prealloc() path or not, this would fail. If we
> remove
> > > the
> > > > > check for nodes within this path, then we would have to
> BUG_ON()
> > > when
> > > > > we
> > > > > run out of nodes to use or have a null pointer dereference
> BUG
> > > > > anyways.
> > > > >
> > > > Yap, the root cause is oom. The BUG_ON() for the situations
> that
> > > the
> > > > maple tree struct cannot be maintained because of the lack of
> > > memory is
> > > > necessary. But the the buddy system in linux kernel can reclaim
> > > memory
> > > > when the system is under the low memory status. If we use
> > > GFP_KERNEL
> > > > after trying GFP_NOWAIT to allocate node, maybe we can get
> enough
> > > > memory when the second try with GFP_KERNEL.
> > >
> > > Right, but the GFP_KERNEL cannot be used when holding certain
> locks
> > > because it may sleep.
> > >
> > > > > >
> > > > > > > >
> > > > > > > > We have seen that there may be some maple_tree
> operations
> > > in
> > > > > > > merge_vma...
> > > > > > >
> > > > > > > If merge_vma() does anything, then there was an operation
> to
> > > the
> > > > > > > maple
> > > > > > > tree.
> > > > > > >
> > > > > > > >
> > > > > > > > Moreover, would maple_tree provides an API for
> assigning
> > > user's
> > > > > gfp
> > > > > > > flag for allocating node?
> > > > > > >
> > > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > > > argument. In
> > > > > > > your call stack, it will be called in __vma_adjust() as
> such:
> > > > > > >
> > > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > > > return -ENOMEM;
> > > > > > >
> > > > > > > line 715 in v6.1.25
> > > > > > >
> > > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in
> this
> > > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node
> with
> > > > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > > > >
> > > > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > > > >
> > > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > > > __GFP_NOWARN
> > > > > > in inserting tree flow. Do you consider the performance of
> > > > > maintaining
> > > > > > the structure of maple_tree?
> > > > >
> > > > > Sorry, I don't understand what you mean by 'consider the
> > > performance
> > > > > of
> > > > > maintaining the structure of maple_tree'.
> > > > >
> > > > As I mentioned above, GFP_NOWAIT will not allow buddy system
> for
> > > > reclaiming memory, so "Do you consider the performance of
> > > maintaining
> > > > the structure of maple_tree" means that: whether the
> > > mas_node_count()
> > > > path is not allowed to reclaim or compact memory for the
> > > performance.
> > >
> > > Ah, no. This is not for performance. It was initially on the
> road
> > > map
> > > for performance, but it was needed for the complicated locking in
> the
> > > MM
> > > code.
> > >
> > > rb tree embedded the nodes in the VMA which is allocated outside
> this
> > > critical section and so it could use GFP_KERNEL.
> > >
> > As I know, following is rb_tree flow in 5.15.186:
> >
> > ...
> > mmap_write_lock_killable(mm)
> > ...
> > do_mmap()
> > ...
> > mmap_region()
> > ...
> > vm_area_alloc(mm)
> > ...
> > mmap_write_unlock(mm)
> >
> > vm_area_alloc is in the mmap_lock hoding period.
> > It seems that the flow would sleep here in rb_tree flow.
> > If I miss anything, please tell me, thanks!
>
> Before the mmap_write_unlock(mm) in the above sequence, the
> i_mmap_lock_write(), anon_vma_lock_write(), and/or the
> flush_dcache_mmap_lock() may be taken. Check __vma_adjust().
>
> The insertion into the tree needs to hold some subset of these locks.
> The rb-tree insert did not allocate within these locks, but the maple
> tree would need to allocate within these locks to insert into the
> tree.
> This is why the preallocation exists and why it is necessary.
>

Yap, preallocation is necessary. anon_vma_lock_write() and
flush_dcache_mmap_lock() hold the lock and manipulate rb_tree. I think
that there is no maple tree manipulations during the lock holding
period. Is there any future work in this section?

> >
> >
> > > > > >
> > > > > > > It also will drop the lock and retry with GFP_KERNEL on
> > > failure
> > > > > > > when not using the external lock. The mmap_lock is
> > > configured as
> > > > > an
> > > > > > > external lock.
> > > > > > >
> > > > > > > > Allocation will not wait for reclaiming and compacting
> when
> > > > > there
> > > > > > > is no enough available memory.
> > > > > > > > Is there any concern for this design?
> > > > > > >
> > > > > > > This has been addressed above, but let me know if I
> missed
> > > > > anything
> > > > > > > here.
> > > > > > >
> > > > > >
> > > > > > I think that the mas_node_count() has higher rate of
> triggering
> > > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> > > __GFP_NOWARN. If
> > > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> > > mmap.c,
> > > > > the
> > > > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > > > >
> > > > > Which BUG_ON() are you referring to?
> > > > >
> > > > > If I was to separate the code path for mas_store_prealloc()
> and
> > > > > mas_store_gfp(), then a BUG_ON() would still need to exist
> and
> > > still
> > > > > would have been triggered.. We are in a place in the code
> where
> > > we
> > > > > should never sleep and we don't have enough memory allocated
> to
> > > do
> > > > > what
> > > > > was necessary.
> > > > >
> > > > Yap. There is no reason to seprate mas_store_prealloc() and
> > > > mas_store_gfp. Is it possible to retry to allocate mas_node
> with
> > > > GFP_KERNEL (wait for system reclaim and compact) instead of
> > > triggering
> > > > BUG_ON once the GFP_NOWAIT allocation failed?
> > >
> > > Unfortunately not, no. In some cases it may actually work out
> that
> > > the
> > > VMA may not need the locks in question, but it cannot be
> generalized
> > > for
> > > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls
> so
> > > we
> > > can let reclaim and compact run, but it's not possible here.
> > >
> > We have used GFP_KERNEL as alloc flag in mas_node_count flow about
> 2
> > months. The mentioned problem we mentioned in the first mail didn't
> > reproduce under high stress stability test.
> >
> > I have seen the patch provided by you. I will verify this patch in
> our
> > stability test. But there is a problem, if maple_tree behavior is
> > unexpected (e.g. redundant write bug this time). We can only review
> the
> > whole mm flow to find out the wrong behavior. Do we have an
> efficient
> > debug method for maple tree?
>
> There is a test suite for the maple tree found in
> tools/testing/radix-tree, but it does not test the mm code and
> integration.
>
> There are other mm tests, but they will not test the OOM scenario you
> hit - to the best of my knowledge.
>
> There are also config options to debug the tree operations, but they
> do
> not detect the redundant write issues. Perhaps I can look at adding
> support for detecting redundant writes, but that will not be
> backported
> to a stable kernel.
>

The sufficient test cases of maple tree ensure the function work well.
But the redundant operations (alloc node, free node, tree
manipulations) of maple_tree are not easy to detect (e.g. the case
reported this time and mas_preallocate() allocates redundant nodes with
the worst case).

The detecting redundant writes mechanism may help the developers to
find out the problems easier. Hope it can be establised successfully!!
> Thanks,
> Liam

Best Regards,
John Hsu