Re: [PATCH] Fix a race condition between ->i_mapping and iput()

From: OGAWA Hirofumi
Date: Fri Mar 17 2006 - 09:10:28 EST


Andrew Morton <akpm@xxxxxxxx> writes:

>> This patch takes ref-count of bdev's inode for fs's inode before
>> setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
>> So, if fs's inode is still living, bdev's inode shouldn't be freed.
>>
>
> OK, so to rephrase:
>
> Whenever /dev/sda's inode->i_mapping points at a kernel-internal blockdev
> inode's i_data, we will hold a ref on that blockdev inode, to pin its
> i_data. That sounds sane.

Yes, thanks.

This approach may have one issue. If user is using a on-memory fs as
/dev (i.e. udev), the inode is never freed, so blockdev's inode is
also never freed.

And if user is creating the many block special file, blockdev's inode
is also pinned additionally.

Probably we can call a igrab() for ->i_mapping instead of this approach,
however it's "run-time overhead vs memory space" trade-off.

>> diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
>> --- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2 2006-03-11 16:19:07.000000000 +0900
>> +++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-11 17:52:11.000000000 +0900
>> @@ -418,21 +418,31 @@ EXPORT_SYMBOL(bdput);
>> static struct block_device *bd_acquire(struct inode *inode)
>> {
>> struct block_device *bdev;
>> +
>> spin_lock(&bdev_lock);
>> bdev = inode->i_bdev;
>> - if (bdev && igrab(bdev->bd_inode)) {
>> + if (bdev) {
>> + atomic_inc(&bdev->bd_inode->i_count);
>
> No longer checking (I_FREEING|I_WILL_FREE). Why was this change included?

With this patch, inode->i_bdev is freed in clear_inode() after the
inode became the I_FREEING.

The caller is holding a inode's ref, and if the inode->i_bdev != NULL,
->i_bdev->bd_inode never have the I_FREEING/I_WILL_FREE. So we don't
need to check I_FREEING/I_WILL_FREE anymore.

>> - if (inode->i_bdev)
>> - __bd_forget(inode);
>> - inode->i_bdev = bdev;
>> - inode->i_mapping = bdev->bd_inode->i_mapping;
>> - list_add(&inode->i_devices, &bdev->bd_inodes);
>> + if (!inode->i_bdev) {
>> + /*
>> + * We take an additional bd_inode->i_count for inode,
>> + * and it's released in clear_inode() of inode.
>> + * So, we can access it via ->i_mapping always
>> + * without igrab().
>> + */
>> + atomic_inc(&bdev->bd_inode->i_count);
>> + inode->i_bdev = bdev;
>> + inode->i_mapping = bdev->bd_inode->i_mapping;
>> + list_add(&inode->i_devices, &bdev->bd_inodes);
>> + }
>
> And why this change? The removal of the __bd_forget() and the changed
> behaviour if (inode->i_bdev != NULL)?

With the above change, we don't need to care the case of igrab()
returns NULL anymore. Since inode->i_bdev->bd_inode never has the
I_FREEING, we don't need to call __bd_forget() for that case.

I think we need to address an only following simple race here.

cpu0 cpu1
-----------------------------------+-----------------------------------------
bdev = bdget(inode->i_rdev);
if (bdev) {
bdev = bdget(inode->i_rdev);
if (bdev) {
spin_lock(&bdev_lock);
if (!inode->i_bdev) {
[...]
inode->i_bdev = bdev;
spin_lock(&bdev_lock);

Thanks.
--
OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/