I found a race condition in the 2.4.20 kernel that involves
iput() and the syncing of inodes to disk when their reference
counts drop to 0. The problem involves the temporary releasing
of inode_lock while the inode is being written to disk. Once
the write is finished, inode_lock is again acquired with the
assumption that no other task has removed the inode from the
inode_unused list and started to destroy it. However, this
assumption may be violated if another task is executing
kill_super() while the write is in progress. Here is a scenario
that demonstrates the problem:
1. Task A is inside kill_super(). It clears the MS_ACTIVE
flag of the s_flags field of the super_block struct and
calls invalidate_inodes(). In invalidate_inodes(), it
attempts to acquire inode_lock and spins because task B,
executing inside iput(), just decremented the reference
count of some inode i to 0 and acquired inode_lock.
2. Then the "if (!inode->i_nlink)" test condition evaluates
to false for task B so it executes the following chunk
of code:
01 } else {
02 if (!list_empty(&inode->i_hash)) {
03 if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
04 list_del(&inode->i_list);
05 list_add(&inode->i_list, &inode_unused);
06 }
07 inodes_stat.nr_unused++;
08 spin_unlock(&inode_lock);
09 if (!sb || (sb->s_flags & MS_ACTIVE))
10 return;
11 write_inode_now(inode, 1);
12 spin_lock(&inode_lock);
13 inodes_stat.nr_unused--;
14 list_del_init(&inode->i_hash);
15 }
16 list_del_init(&inode->i_list);
17 inode->i_state|=I_FREEING;
18 inodes_stat.nr_inodes--;
19 spin_unlock(&inode_lock);
20 if (inode->i_data.nrpages)
21 truncate_inode_pages(&inode->i_data, 0);
22 clear_inode(inode);
23 }
24 destroy_inode(inode);
Now the test condition on line 02 evaluates to true, so
task B adds the inode to the inode_unused list and then
releases inode_lock on line 08.
3. Now A acquires inode_lock and B spins on inode_lock inside
write_inode_now();
4. Task A calls invalidate_list(), transferring inode i from
the inode_unused list to its own private throw_away list.
5. Task A releases inode_lock, allowing B to acquire inode_lock
and continue executing.
6. A attempts to destroy inode i inside dispose_list() while B
simultaneously attempts to destroy i, potentially causing
all sorts of bad things to happen.
To fix the problem, I have appended to this message a 2.4.20 kernel
patch for fs/inode.c. It alters the behavior of iput() as follows:
Move the "if (!sb || (sb->s_flags & MS_ACTIVE))" so that is
evaluated while still holding inode_lock. Then add the inode to
the inode_unused list (provided that that the inode is neither
locked nor dirty), release inode_lock, and return only if the
test condition evaluates to true. Otherwise do not add the
inode to the inode_unused list. Instead, remove the inode from
its hash list so no other tasks can obtain references to it,
call __iget() to increment the reference count back to 1 and
add the inode to nodes_in_use, and then release inode_lock and
write the inode to disk. Since we have a reference to the inode,
no other task can attempt to destroy it while the write is in
progress. Once the write finished, reacquire inode_lock, set the
inode's reference count to 0, remove the inode from the
nodes_in_use list, and destroy the inode.
In addition, the patch fixes a minor problem in which the bookkeeping
for inodes_stat.nr_unused was being done incorrectly. It also
cleans up the function sync_one() which is shown here in its original
form:
static inline void sync_one(struct inode *inode, int sync)
{
while (inode->i_state & I_LOCK) {
__iget(inode);
spin_unlock(&inode_lock);
__wait_on_inode(inode);
iput(inode);
spin_lock(&inode_lock);
}
__sync_one(inode, sync);
}
The logic of this function is inherently flawed in the way that it
manipulates the reference count on the inode. Presumably, __iget() is
called to prevent the inode from being destroyed after inode_lock is
released. After __wait_on_inode() returns, iput() is called to release
the reference that was just acquired. Then, inode_lock is acquired
again. This is problematic because the call to iput() could cause the
reference count on the inode to drop to 0, resulting in a call to
__sync_one() on an inode that has been destroyed. The assumption
should be that prior to calling sync_one(), the caller has obtained its
own reference to the inode. After examining all callers of sync_one()
and verifying that this assumption does in fact hold, I have removed
the calls to __iget() and iput() from sync_one().
-Dave Peterson
dsp@llnl.gov
P.S. Please CC my email address when responding to this message.
----- START OF PATCH FOR fs/inode.c (kernel 2.4.20) --------------------------
--- inode.c.original Fri Apr 25 16:49:08 2003
+++ inode.c Fri Apr 25 17:04:49 2003
@@ -202,7 +202,6 @@
list_del(&inode->i_list);
list_add(&inode->i_list, &inode_in_use);
}
- inodes_stat.nr_unused--;
}
static inline void __sync_one(struct inode *inode, int sync)
@@ -237,8 +236,10 @@
to = &inode->i_sb->s_dirty;
else if (atomic_read(&inode->i_count))
to = &inode_in_use;
- else
+ else {
to = &inode_unused;
+ inodes_stat.nr_unused++;
+ }
list_del(&inode->i_list);
list_add(&inode->i_list, to);
}
@@ -248,10 +249,8 @@
static inline void sync_one(struct inode *inode, int sync)
{
while (inode->i_state & I_LOCK) {
- __iget(inode);
spin_unlock(&inode_lock);
__wait_on_inode(inode);
- iput(inode);
spin_lock(&inode_lock);
}
@@ -1065,18 +1064,25 @@
BUG();
} else {
if (!list_empty(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
- list_del(&inode->i_list);
- list_add(&inode->i_list, &inode_unused);
+ if (!sb || (sb->s_flags & MS_ACTIVE)) {
+ if (!(inode->i_state &
+ (I_DIRTY|I_LOCK))) {
+ list_del(&inode->i_list);
+ list_add(&inode->i_list,
+ &inode_unused);
+ inodes_stat.nr_unused++;
+ }
+
+ spin_unlock(&inode_lock);
+ return;
}
- inodes_stat.nr_unused++;
+
+ list_del_init(&inode->i_hash);
+ __iget(inode);
spin_unlock(&inode_lock);
- if (!sb || (sb->s_flags & MS_ACTIVE))
- return;
write_inode_now(inode, 1);
spin_lock(&inode_lock);
- inodes_stat.nr_unused--;
- list_del_init(&inode->i_hash);
+ atomic_set(&inode->i_count, 0);
}
list_del_init(&inode->i_list);
inode->i_state|=I_FREEING;
----- END OF PATCH FOR fs/inode.c --------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Wed Apr 30 2003 - 22:00:30 EST