Re: [RFC PATCH] ext4: increase the protection of drop nlink and ext4 inode destroy

From: zhangyi (F)
Date: Sun Jan 15 2017 - 22:25:22 EST



on 2017/1/11 23:34, Theodore Ts'o wrote:
> On Wed, Jan 11, 2017 at 05:07:29PM +0800, zhangyi (F) wrote:
>>
>> (1) The file we want to unlink have many hard links, but only one dcache entry in memory.
>> (2) open this file, but it's inode->i_nlink read from disk was 1 (too low).
>> (3) some one call rename and drop it's i_nlink to zero.
>> (4) it's inode is still in use and do not destroy (not closed), at the same time,
>> some others open it's hard link and create a dcache entry.
>> (5) call rename again and it's i_nlink will still underflow and cause memory corruption.
>
> Do you have reproducers that make it easy to reproduce situations like
> this? (It shouldn't be hard to write, but if you have them already
> will save me some effort. :-)
>

I make a reproducer, we can do the following steps to reproduce this probrem easily:
1) mount a ext4 file system, and create 3 files and 1 hard link,

#mount /dev/sdax /mnt
#cd /mnt
#touch old_file1 old_file2 new_file
#ln new_file new_link1

2) umount the file system and use the debugfs to change new_file's
links_count value to 1, which is used to simulate the fs inconsistency,

#umount /mnt
#debugfs /dev/sdax -w
set_inode_field new_file links_count 1

3) mount the fs again, and then execute the following program (Note:
do not execute the ls cmd, it will create the second dcache entry),

#define RENAME_OLD_FILE_1 "old_file1"
#define RENAME_OLD_FILE_2 "old_file2"
#define RENAME_NEW_FILE "new_file"
#define NEW_FILE_LINK_1 "new_link1"

int main(int argc, char *argv[])
{
int fd = 0;
int err = 0;

fd = open(RENAME_NEW_FILE, O_RDONLY);
if (fd < 0) {
printf("open error:%d\n", errno);
return -1;
}

err = rename(RENAME_OLD_FILE_1, RENAME_NEW_FILE);
if (err < 0) {
printf("rename error:%d\n", errno);
close(fd);
return -1;
}

err = rename(RENAME_OLD_FILE_2, NEW_FILE_LINK_1);
if (err < 0) {
printf("rename error:%d\n", errno);
close(fd);
return -1;
}

close(fd);
return 0;
}

4) after this, the new_file's inode->i_nlink is underflowed and add to orphan list,
kernel dump like this:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 1814 at fs/inode.c:282 drop_nlink+0x3e/0x50
...
Call Trace:
dump_stack+0x63/0x86
__warn+0xcb/0xf0
warn_slowpath_null+0x1d/0x20
drop_nlink+0x3e/0x50
ext4_rename+0x532/0x8c0
ext4_rename2+0x1d/0x30
vfs_rename+0x728/0x940
? __lookup_hash+0x20/0xa0
SyS_rename+0x3ba/0x3e0
entry_SYSCALL_64_fastpath+0x1a/0xa9
...
---[ end trace b157dacbc891e6e8 ]---

5) then, we trigger mem shrink, this inode will be destroyed but it is still
on the orphan list,

#echo 3 > /proc/sys/vm/drop_caches

kernrl dump:

EXT4-fs (sdb1): Inode 16 (ffff98f4b3285c20): orphan list check failed!
...
ffff98f4b3285d30: fa87e800 ffff98f4 b3285e80 ffff98f4 .........^(.....
ffff98f4b3285d40: b20829d8 ffff98f4 00000010 00000000 .)..............
ffff98f4b3285d50: ffffffff 00000000 00000000 00000000 ................
...
Call Trace:
dump_stack+0x63/0x86
ext4_destroy_inode+0xa0/0xb0
destroy_inode+0x3b/0x60
evict+0x130/0x1c0
dispose_list+0x4d/0x70
prune_icache_sb+0x5a/0x80
super_cache_scan+0x14b/0x1a0
shrink_slab.part.40+0x1f5/0x420
shrink_slab+0x29/0x30
drop_slab_node+0x31/0x60
drop_slab+0x3f/0x70
drop_caches_sysctl_handler+0x71/0xc0
proc_sys_call_handler+0xea/0x110
proc_sys_write+0x14/0x20
__vfs_write+0x37/0x160
? selinux_file_permission+0xd7/0x110
? security_file_permission+0x3b/0xc0
vfs_write+0xb5/0x1a0
SyS_write+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xa9
...
bash (1594): drop_caches: 3

6) Some time later, if we change the orphan list, it will cause memory corruption.

Thanks.

zhangyi