Re: Ext4: deadlock occurs when running fsstress and ENOSPC errors are seen.

From: Amit Sahrawat
Date: Wed Apr 16 2014 - 01:00:34 EST


Thanks Ted, for the detailed reply.

We could not retain the original HDD â on which we got the issue. In
order to replicate the problem we did steps like this:
1) Make 250MB partition
2) Format the partition with blocksize â4Kâ
3) Using dumpe2fs â get the block group information
Group 0: (Blocks 0-32767)
Checksum 0x89ae, unused inodes 29109
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-15
Block bitmap at 16 (+16), Inode bitmap at 32 (+32)
Inode table at 48-957 (+48)
26826 free blocks, 29109 free inodes, 2 directories, 29109 unused inodes
Free blocks: 5942-32767
Free inodes: 12-29120
Group 1: (Blocks 32768-58226) [INODE_UNINIT]
Checksum 0xb43a, unused inodes 29120
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32783
Block bitmap at 17 (bg #0 + 17), Inode bitmap at 33 (bg #0 + 33)
Inode table at 958-1867 (bg #0 + 958)
25443 free blocks, 29120 free inodes, 0 directories, 29120 unused inodes
Free blocks: 32784-58226
Free inodes: 29121-58240
4) Corrupt the block group â1â by writing all â1â, we had one file
with all 1âs, so using âddâ â
dd if=i_file of=/dev/sdb1 bs=4096 seek=17 count=1
After this mount the partition â create few random size files and then
ran âfsstress,

fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir

and we get logs like before the hang:
#fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
seed = 582332
EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
a risk of filesystem corruption in case of system crash.

EXT4-fs (sdb1): delayed block allocation failed for inode 26 at
logical offset 181 with max blocks 2 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=51
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=9
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 101 at
logical offset 198 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=43
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=1
EXT4-fs (sdb1): i_reserved_meta_blocks=1

EXT4-fs (sdb1): delayed block allocation failed for inode 94 at
logical offset 450 with max blocks 1 with error -28
EXT4-fs (sdb1): This should not happen!! Data will be lost

EXT4-fs (sdb1): Total free blocks count 0
EXT4-fs (sdb1): Free/Dirty block details
EXT4-fs (sdb1): free_blocks=25443
EXT4-fs (sdb1): dirty_blocks=36
EXT4-fs (sdb1): Block reservation details
EXT4-fs (sdb1): i_reserved_data_blocks=12
EXT4-fs (sdb1): i_reserved_meta_blocks=2


â
EXT4-fs (sdb1): error count: 3
EXT4-fs (sdb1): initial error at 545: ext4_mb_generate_buddy:743
EXT4-fs (sdb1): last error at 42: ext4_mb_generate_buddy:743
â
Yes, we are running this on ARM target and we did not set the time
before running these operations. So, the time actually corresponds to
date -d @545
Thu Jan 1 05:39:05 IST 1970
date -d @42
Thu Jan 1 05:30:42 IST 1970


Yes, we are running kernel version 3.8 and applied the patches from
âDarrickâ to fix the first looping around in ext4_da_writepages.

As you suggested, we also made change to mark the FS READ ONLY in case
of corruption. The changes are;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c0fbd96..04f3a66 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1641,8 +1641,10 @@ static void mpage_da_map_and_submit(struct
mpage_da_data *mpd)
mpd->b_size >> mpd->inode->i_blkbits, err);
ext4_msg(sb, KERN_CRIT,
"This should not happen!! Data will be lost\n");
- if (err == -ENOSPC)
+ if (err == -ENOSPC) {
ext4_print_free_blocks(mpd->inode);
+ mpd->inode->i_sb->s_flags |= MS_RDONLY;
+ }
}
/* invalidate all the pages */
ext4_da_block_invalidatepages(mpd);
@@ -2303,6 +2305,36 @@ out:
return ret;
}

+static void ext4_invalidate_mapping(struct address_space *mapping)
+{
+ struct pagevec pvec;
+ unsigned int i;
+ pgoff_t index = 0;
+ struct inode* inode = mapping->host;
+
+ pagevec_init(&pvec, 0);
+ while (pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY,
+ PAGEVEC_SIZE)) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+ lock_page(page);
+ spin_lock_irq(&mapping->tree_lock);
+ if (test_bit(PG_dirty, &page->flags)) {
+ printk(KERN_ERR"INO[%lu], i(%d)\n",
inode->i_ino, i);
+ radix_tree_tag_clear(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_DIRTY);
+ spin_unlock_irq(&mapping->tree_lock);
+ clear_page_dirty_for_io(page);
+ goto unlock_page;
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+unlock_page:
+ unlock_page(page);
+ }
+ pagevec_release(&pvec);
+ }
+}
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
@@ -2419,6 +2451,14 @@ retry:
ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
"%ld pages, ino %lu; err %d", __func__,
wbc->nr_to_write, inode->i_ino, ret);
+
+ if(ret == -EROFS) {
+ ext4_msg(inode->i_sb, KERN_CRIT,"This
is probably result of corruption"
+ "that FS is marked RO
in between writepages\n");
+ /*Invalidate mapping if RO encountered
during writepages */
+ ext4_invalidate_mapping(mapping);
+ invalidate_inode_pages2(mapping);
+ }
blk_finish_plug(&plug);
goto out_writepages;
}

After making these changes, as soon as there is corruption â the FS is
marked RO and all pending DIRTY writes are cleared. So, the hang is
not seen.

But we have a doubt about the possible side-effects of this code. Can
you please provide your inputs on this? It will be of immense help.

Thanks & Regards,
Amit Sahrawat


On Tue, Apr 15, 2014 at 6:17 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> On Tue, Apr 15, 2014 at 01:24:26PM +0530, Amit Sahrawat wrote:
>> Initially in normal write path, when the disk was almost full â we got
>> hung for the âsyncâ because the flusher (which is busy in the
>> writepages is not responding). Before the hung task, we also found the
>> logs like:
>>
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1493, 0
>> clusters in bitmap, 58339 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1000, 0
>> clusters in bitmap, 3 in gd
>> EXT4-fs error (device sda1): ext4_mb_generate_buddy:742: group 1425, 0
>> clusters in bitmap, 1 in gd
>
> These errors indicate that the several block groups contain have
> corrupt block bitmap: block group #1493, #1000, and #1425. The fact
> that there are 0 free blocks/clusters in the bitmap indicate that the
> bitmap was all zero's, which could be the potential cause of the
> corruption.
>
> The other thing which is funny is the number of free blocks/clusters
> being greater than 32768 in block group #1493. Assuming a 4k page
> size, that shouldn't be possible. Can you send the output of
> "dumpe2fs -h /dev/sdXX" so we can take a look at the file system parameters?
>
> How much before the hung task did you see these messages? I normally
> recommend that the file system be set up to either panic the system,
> or force the file system to be remounted read/only when EXT4-fs error
> messages are reported, since that means that the file system is
> corrupted, and further operaion can cause more data to be lost.
>
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>> JBD2: Spotted dirty metadata buffer (dev = sda1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>>
>> EXT4-fs (sda1): error count: 58
>> EXT4-fs (sda1): initial error at 607: ext4_mb_generate_buddy:742
>> EXT4-fs (sda1): last error at 58: ext4_mb_generate_buddy:742
>
> The "607" and "58" in the "at 607" and "at 58" are normally supposed
> to be a unix time_t value. That is, it's normally a number like:
> 1397564866, and it can be decoded via:
>
> % date -d @1397564866
> Tue Apr 15 08:27:46 EDT 2014
>
> The fact that these numbers are numerically so small means that the
> time wasn't set correctly on your system. Was this a test system
> running under kvm without a proper real-time clock?
>
>> When we analysed the problem, it occurred from the writepages path in ext4.
>> This is because of the difference in the free blocks reported by
>> cluster bitmap and the number of free blocks reported by group
>> descriptor.
>
> Yes, indicating that the file system was corrupt.
>
>> During ext4_fill_super, ext4 calculates the number of free blocks by
>> reading all the descriptors in function ext4_count_free_clusters and
>> store it in percpu counter s_freeclusters_counter.
>> ext4_count_free_clusters:
>> desc_count = 0;
>> for (i = 0; i < ngroups; i++) {
>> gdp = ext4_get_group_desc(sb, i, NULL);
>> if (!gdp)
>> continue;
>> desc_count += ext4_free_group_clusters(sb, gdp);
>> }
>> return desc_count;
>>
>> During writebegin call, ext4 checks this s_freeclusters_counter
>> counter to know if there are free blocks present or not.
>> When the free blocks reported by group descriptor are greater than the
>> actual free blocks reported by bitmap, a call to writebegin could
>> still succeed even if the free blocks represented by bitmaps are 0.
>
> Yes. We used to have code that would optionally read every single
> bitmap, and verify that the descriptor counts match the values in the
> bitmap. However, that was expensive, and wasn't a full check of all
> possible file system inconsistencies that could lead to data loss. So
> we ultimately removed this code. If the file system is potentially
> corrupt, it is the system administrator's responsibility to force an
> fsck run to make sure the file system data structures are consistent.
>
>> When searching for the relevant problem which occurs in this path. We
>> got the patch-set from âDarrickâ which revolves around this problem.
>> ext4: error out if verifying the block bitmap fails
>> ext4: fix type declaration of ext4_validate_block_bitmap
>> ext4: mark block group as corrupt on block bitmap error
>> ext4: mark block group as corrupt on inode bitmap error
>> ext4: mark group corrupt on group descriptor checksum
>> ext4: don't count free clusters from a corrupt block group
>>
>> After adopting the patch-set and performing verification on the
>> similar setup, we ran âfsstressâ. But now it is resulting in hang at
>> different points.
>>
>> In the current logs we got:
>> EXT4-fs error (device sdb1): ext4_mb_generate_buddy:743: group 1,
>> 20480 clusters in bitmap, 25443 in gd; block bitmap corrupt.
>> JBD2: Spotted dirty metadata buffer (dev = sdb1, blocknr = 0). There's
>> a risk of filesystem corruption in case of system crash.
>
> OK, what version of the kernel are you using? The patches that you
> reference above have been in the upstream kernel since 3.12, so I'm
> assuming you're not using the latest upstream kernel, but rather an
> older kernel with some patches applied. Hmmm, skipping ahead:
>
>> Kernel Version: 3.8
>> Test command:
>> fsstress -p 10 -n 100 -l 100 -d /mnt/test_dir
>
> There is clearly either some kernel bug or hardware problem which is
> causing the file system corruption. Given that you are using a much
> older kernel, it's quite likely that there is some bug that has been
> fixed in a later version of the kernel (although we can't really rule
> out a hardware problem without know much more about your setup).
>
> Unfortunately, there has been a *large* number of changes since
> version 3.8, and I can't remember all of the changes and bug fixes
> that we might have made in the past year or more (v3.8 dates from
> March 2013).
>
> Something that might be helpful is for you to use xfstests. That's a
> much more thorough set of tests which we've been using so if you must
> use an antique version of the kernel, that will probably be a much
> better set of tests. It includes fsstress, and much more besides.
> More importantly, there are times when fixes are identified by the
> xfstest failure that has gotten fixed up in the commit logs. So that
> might help you find the bug fix that you need to backport.
>
> For your convenience, there is a simple test framework that makes it
> relatively easy to build and run xfstests under KVM. You can find it
> here:
>
> git://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git
>
> See the documentation found at:
>
> https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/README
>
> for more details.
>
> I hope this helps,
>
> - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/