Re: linux-next: slab shrinkers: BUG at mm/list_lru.c:92
From: Michal Hocko
Date: Mon Jul 15 2013 - 05:14:35 EST
On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > [...]
> > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > info, Michal, it's time to go look at the code....
> > >
> > > OK, just in case we will need it, I am keeping the machine in this state
> > > for now. So we still can play with crash and check all the juicy
> > > internals.
> >
> > My current suspect is the LRU_RETRY code. I don't think what it is
> > doing is at all valid - list_for_each_safe() is not safe if you drop
> > the lock that protects the list. i.e. there is nothing that protects
> > the stored next pointer from being removed from the list by someone
> > else. Hence what I think is occurring is this:
> >
> >
> > thread 1 thread 2
> > lock(lru)
> > list_for_each_safe(lru) lock(lru)
> > isolate ......
> > lock(i_lock)
> > has buffers
> > __iget
> > unlock(i_lock)
> > unlock(lru)
> > ..... (gets lru lock)
> > list_for_each_safe(lru)
> > walks all the inodes
> > finds inode being isolated by other thread
> > isolate
> > i_count > 0
> > list_del_init(i_lru)
> > return LRU_REMOVED;
> > moves to next inode, inode that
> > other thread has stored as next
> > isolate
> > i_state |= I_FREEING
> > list_move(dispose_list)
> > return LRU_REMOVED
> > ....
> > unlock(lru)
> > lock(lru)
> > return LRU_RETRY;
> > if (!first_pass)
> > ....
> > --nr_to_scan
> > (loop again using next, which has already been removed from the
> > LRU by the other thread!)
> > isolate
> > lock(i_lock)
> > if (i_state & ~I_REFERENCED)
> > list_del_init(i_lru) <<<<< inode is on dispose list!
> > <<<<< inode is now isolated, with I_FREEING set
> > return LRU_REMOVED;
> >
> > That fits the corpse left on your machine, Michal. One thread has
> > moved the inode to a dispose list, the other thread thinks it is
> > still on the LRU and should be removed, and removes it.
> >
> > This also explains the lru item count going negative - the same item
> > is being removed from the lru twice. So it seems like all the
> > problems you've been seeing are caused by this one problem....
> >
> > Patch below that should fix this.
>
> Good news! The test was running since morning and it didn't hang nor
> crashed. So this really looks like the right fix. It will run also
> during weekend to be 100% sure. But I guess it is safe to say
>
> Tested-by: Michal Hocko <mhocko@xxxxxxx>
And I can finally confirm this after over weekend testing on ext3.
Thanks a lot for your help Dave!
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/