Re: 2.6.33-rc1: kernel BUG at fs/ext4/inode.c:1063 (sparc)

From: Dmitry Monakhov
Date: Wed Dec 30 2009 - 08:18:22 EST


tytso@xxxxxxx writes:

> On Sun, Dec 27, 2009 at 10:51:59PM -0500, tytso@xxxxxxx wrote:
>> OK, i've been able to reproduce the problem using xfsqa test #74
>> (fstest) when an ext3 file system is mounted the ext4 file system
>> driver. I was then able to bisect it down to commit d21cd8f6, which
>> was introduced between 2.6.33-rc1 and 2.6.33-rc2, as part of
>> quota/ext4 patch series pushed by Jan.
>
> OK, here's a patch which I think should avoid the BUG in
> fs/ext4/inode.c. It should fix the regression, but in the long run we
> need to pretty seriously rethink how we account for the need for
> potentially new meta-data blocks when doing delayed allocation.
>
> The remaining problem with this machinery is that
> ext4_da_update_reserve_space() and ext4_da_release_space() is that
> they both try to calculate how many metadata blocks will potentially
> required by calculating ext4_calc_metadata_amount() based on the
> number of delayed allocation blocks found in i_reserved_data_blocks.
> The problem is that ext4_calc_metadata_amount() assumes that the
> number of blocks passed to it is contiguous, and what might be left
> remaining to be written in the page cache could be anything but
> contiguous. This is a problem which has always been there, so it's
> not a regression per se; just a design flaw.
Hello, I've finally able to reproduce the issue. I'm agree with your
diagnose. But while looking in to code i've found some questions
see late in the message.
>
> The patch below should fixes the regression caused by commit d21cd8f,
> but we need to look much more closely to find a better way of
> accounting for the potential need for metadata for inodes facing
> delayed allocation. Could people who are having problems with the BUG
> in line 1063 of fs/ext4/inode.c try this patch?
>
> Thanks!!
>
> - Ted
>
>
> commit 48b71e562ecd35ab12f6b6420a92fb3c9145da92
> Author: Theodore Ts'o <tytso@xxxxxxx>
> Date: Wed Dec 30 00:04:04 2009 -0500
>
> ext4: Patch up how we claim metadata blocks for quota purposes
>
> Commit d21cd8f triggered a BUG in the function
> ext4_da_update_reserve_space() found in fs/ext4/inode.c, which was
> caused by fact that ext4_calc_metadata_amount() can over-estimate how
> many metadata blocks will be needed, especially when using direct
> block-mapped files. Work around this by not claiming any excess
> metadata blocks than we are prepared to claim at this point.
>
> Signed-off-by: "Theodore Ts'o" <tytso@xxxxxxx>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3e3b454..d6e84b4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1058,14 +1058,23 @@ static void ext4_da_update_reserve_space(struct inode *inode, int used)
> mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>
> if (mdb_free) {
> - /* Account for allocated meta_blocks */
> + /*
> + * Account for allocated meta_blocks; it is possible
> + * for us to have allocated more meta blocks than we
> + * are prepared to free at this point. This is
> + * because ext4_calc_metadata_amount can over-estimate
> + * how many blocks are still needed. So we may not be
> + * able to claim all of the allocated meta blocks
> + * right away. The accounting will work out in the end...
> + */
> mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> - BUG_ON(mdb_free < mdb_claim);
> + if (mdb_free < mdb_claim)
> + mdb_claim = mdb_free;
> mdb_free -= mdb_claim;
>
> /* update fs dirty blocks counter */
> percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
> - EXT4_I(inode)->i_allocated_meta_blocks = 0;
> + EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
> EXT4_I(inode)->i_reserved_meta_blocks = mdb;
> }
>
> @@ -1845,7 +1854,7 @@ repeat:
> static void ext4_da_release_space(struct inode *inode, int to_free)
> {
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> - int total, mdb, mdb_free, release;
> + int total, mdb, mdb_free, mdb_claim, release;
>
> if (!to_free)
> return; /* Nothing to release, exit */
> @@ -1874,6 +1883,16 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
> BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
> mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
>
> + if (mdb_free) {
> + /* Account for allocated meta_blocks */
> + mdb_claim = EXT4_I(inode)->i_allocated_meta_blocks;
> + if (mdb_free < mdb_claim)
> + mdb_claim = mdb_free;
> + mdb_free -= mdb_claim;
> +
> + EXT4_I(inode)->i_allocated_meta_blocks -= mdb_claim;
> + }
> +
Seems what this is not enough.
Just imagine, we may have following call-trace:

userspace pwrite(fd, d, 1000, off)
->ext4_da_reserve_space(inode, 1000)
->dq_reserve_space(1000 + md_needed)
userspace ftruncate(fd, off) /* "off" is the same as in pwrite call */
->ext4_da_invalidatepage()
->ext4_da_page_release_reservation()
->ext4_da_release_space()
<<< And we decrease ->i_allocated_meta_blocks only if (mdb_free > 0)
userspace close(fd)
So reserved metadata blocks will leak. I'm able to reproduce it like this:
quotacheck -cu /mnt
quotaon /mnt
fsstres -p 16 -d /mnt -l999999999 -n99999999&
sleep 180
killall -9 fsstress
sync; sync;
cp /mnt/aquota.user > q1
quotaoff /mnt
quotacheck -cu /mnt/ # recaculate real quota usage.
cp /mnt/aquota.user > q2
diff -up q1 q2 # in my case i've found 1 block leaked.

IMHO we may drop i_allocated_meta_block in ext4_release_file()
But while looking in to this function i've found another question
about locking
static int ext4_release_file(struct inode *inode, struct file *filp)
{
if (EXT4_I(inode)->i_state & EXT4_STATE_DA_ALLOC_CLOSE) {
ext4_alloc_da_blocks(inode);
EXT4_I(inode)->i_state &= ~EXT4_STATE_DA_ALLOC_CLOSE;
<<< Seems what i_state modification must being protected by i_mutex,
but currently caller don't have to hold it.
.....
}

> release = to_free + mdb_free;
>
> /* update fs dirty blocks counter for truncate case */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/