Re: 2.0.32 items (was NON IRQ DEADLOCK in 2.0.31)

Doug Ledford (dledford@dialnet.net)
Fri, 14 Nov 1997 18:45:55 -0600 (CST)


On 14-Nov-97 Rob Hagopian wrote:
>> Stephen Tweedie's inode fix (the stuff about i_mmap ring corrupted, this
>> is done differently than the approach Ben LaHaise took to the problem)
>
>Thx! Interesting... where did this enter? Any comments on 2.0.32-2
>w/ben-3-inode vs 2.0.32-4?
> -Rob H.

Stephen posted an email, including the patch, to the LMP group. In short,
the two methods differ in how they solve the same problem. The problem can
be summed up as a change in the inode interface in the VFS. The process of
writing and destroying an inode at the VFS level was no longer atomic. This
created a problem where an individual FS driver could start to get rid of an
inode it was using, then block, then another process end up grabbing that
same inode that's half way through being destroyed, and you get corruption.
The approach of the ben-3-inode patch was to fix the individual file systems
such that they no longer expected the operation to be atomic and could
handle the possibility of an inode getting grabbed by something else. The
approach Stephen took was to change the VFS to restore the gaurantee of
atomicity (actually, it doesn't make the action atomic again, instead, it
adds a new flag on inodes, condemned, so that we can know if an inode is in
the process of getting torn down). This way, the one change to the VFS fixed
all of the drivers back to what they once were. That, in my opinion, is what
makes Stephen's approach the correct one to use.

As a reference, I include Stephen's explanation from his email here (he can
say it better than I can):

From: "Stephen C. Tweedie" <sct@dcs.ed.ac.uk>
Subject: [LMP-CORE] Fix for 2.0.31 VFS

Hi folks,

We have a problem in 2.0's inode.c. I know, I know, we've had a lot
of them... and most of them boil down to the fact that clear_inode()
is no longer atomic --- we need to pull down inode mappings during
clear_inode(), and that necessarily blocks if there is IO in progress
(for example, readahead).

We have already added the fix to increment i_count while we do this,
and that prevents get_empty_inode() from reusing the inode while it is
being pulled down. However, that doesn't help in the scenario where
we have two iget()s going on at once on separate inodes:

Process A does an iget() on inode X. The inode is not loaded, so we
look for an existing cached inode to reuse. Choose inode Y, and start
to destroy it so we can reuse the struct inode. This blocks for some
reason...

Process B does an iget() on inode Y. Oh dear --- that succeeds, but
the inode is shortly going to get blown apart.

Once we are committed to destroying a cached inode, we must prevent that
inode from being reused. We have two choices: we can either defer the
final decision to destroy the inode until after all of the blocking
activity has occurred, or we can prevent the inode's reuse once we have
started to destroy it. Doing the former essentially makes clear_inode()
return a success or failure status, and since every filesystem must call
clear_inode() during put_inode(), that's more of a destabilising risk
than I'd like to see on 2.0. So, that leaves us with the alternative of
locking the inode against reuse while the inode is dying.

However, we cannot lock the inode, since the truncate_inode_pages() may
want to lock the inode if it needs to write out dirty mapped pages. The
patch below introduces a new "condemned" flag which, when set, allows
normal operations on the inode to proceed correctly but prevents any
other reuse of the inode.

----------------------------------
E-Mail: Doug Ledford <dledford@dialnet.net>
Date: 14-Nov-97
Time: 18:45:57
----------------------------------