Re: ext3 lockdep warning in 2.6.25-rc6

From: Jan Kara
Date: Wed Apr 02 2008 - 04:12:40 EST


On Tue 01-04-08 17:23:34, Erez Zadok wrote:
<snip>
> > From f5e41087e345fa5c3b46ac36e6e4a654d2f7f624 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@xxxxxxx>
> > Date: Tue, 18 Mar 2008 14:38:06 +0100
> > Subject: [PATCH] Fix drop_pagecache_sb() to not call __invalidate_mapping_pages() under
> > inode_lock.
> >
> > Signed-off-by: Jan Kara <jack@xxxxxxx>
> > ---
> > fs/drop_caches.c | 8 +++++++-
> > 1 files changed, 7 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > index 59375ef..f5aae26 100644
> > --- a/fs/drop_caches.c
> > +++ b/fs/drop_caches.c
> > @@ -14,15 +14,21 @@ int sysctl_drop_caches;
> >
> > static void drop_pagecache_sb(struct super_block *sb)
> > {
> > - struct inode *inode;
> > + struct inode *inode, *toput_inode = NULL;
> >
> > spin_lock(&inode_lock);
> > list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > if (inode->i_state & (I_FREEING|I_WILL_FREE))
> > continue;
> > + __iget(inode);
> > + spin_unlock(&inode_lock);
> > __invalidate_mapping_pages(inode->i_mapping, 0, -1, true);
> > + iput(toput_inode);
> > + toput_inode = inode;
> > + spin_lock(&inode_lock);
> > }
> > spin_unlock(&inode_lock);
> > + iput(toput_inode);
> > }
>
> Jan, I'll be happy to test this, but I don't understand two things about
> this patch:
>
> 1. Is it safe to unlock and re-lock inode_lock temporarily within the loop?
>
> 2. What's the motivation behind having the second toput_inode pointer? It
> appears that the first iteration through the loop, toput_inode will be
> NULL, so we'll be iput'ing a NULL pointer (which is ok). So you're
> trying to iput the previous inode pointer that the list iterated over,
> right? Is that intended?
I'll try to explain the locking here:
1) We are not allowed to call into __invalidate_mapping_pages() with
inode_lock held - that it the bug lockdep is complaining about. Moreover it
leads to rather long waiting times for inode_lock (quite possibly several
seconds).
2) When we release inode_lock, we need to protect from inode going away,
thus we hold reference to it - that guarantees us inode stays in the list.
3) When we continue scanning of the list we must get inode_lock before we
put the inode reference to avoid races. But we cannot do iput() when we
hold inode_lock. Thus we save pointer to inode and do iput() next time we
have released the inode_lock...

> Peter's post:
>
> http://lkml.org/lkml/2008/3/23/202
>
> includes a reference to a mail by Andrew which implies that the fix may be
> much more involved than what you outlined above, no?
Definitely we can do a more involved fix ;) What Andrew proposes would have
some other benefits as well. But until somebody gets to that, this slight
hack should work fine (Andrew has already merged my patch in -mm so I
guess he agrees ;).

Honza
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/