NFS issue while copying huge file
From: Richard Weinberger
Date: Wed May 21 2014 - 17:33:40 EST
Hi!
My backup server runs CentOS6 with kernel 2.6.32-431.17.1.el6.x86_64.
Many servers copy theirs backups to it. This setups works perfectly
fine since ages.
Today I've configured a new server which runs Kernel 3.12.20 to store
it's backups to the NFS server too.
The backup file has around 400GiB, after 100GiB the client aborts the
copy job with EIO.
dmesg contains:
---cut---
RPC: AUTH_GSS upcall timed out.
Please check user daemon is running.
nfs: server 10.42.0.1 not responding, timed out
nfs: server 10.42.0.1 not responding, timed out
nfs: server 10.42.0.1 not responding, timed out
nfs: server 10.42.0.1 not responding, timed out
---cut---
Some times the client reports also a hung task too.
I'm using vers=4 and tcp.
(Tried also vers=3 with udp and other combinations)
And now the really strange part, on the NFS server the load increases up to 10.
I see lots of nfsd kernel threads consuming 100% CPU.
*sometimes* even the oom killer triggers on the NFS server.
Like:
---cut---
top invoked oom-killer: gfp_mask=0x800d0, order=0, oom_adj=0, oom_score_adj=0
top cpuset=/ mems_allowed=0
Pid: 1907, comm: top Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Call Trace:
[<ffffffff810d0211>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff811225c0>] ? dump_header+0x90/0x1b0
[<ffffffff8122761c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122a42>] ? oom_kill_process+0x82/0x2a0
[<ffffffff8112293e>] ? select_bad_process+0x9e/0x120
[<ffffffff81122e80>] ? out_of_memory+0x220/0x3c0
[<ffffffff8112f79f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff8116e082>] ? kmem_getpages+0x62/0x170
[<ffffffff8116ec9a>] ? fallback_alloc+0x1ba/0x270
[<ffffffff8116e6ef>] ? cache_grow+0x2cf/0x320
[<ffffffff8116ea19>] ? ____cache_alloc_node+0x99/0x160
[<ffffffff8116f99b>] ? kmem_cache_alloc+0x11b/0x190
[<ffffffff811f2910>] ? proc_alloc_inode+0x20/0xb0
[<ffffffff811a5747>] ? alloc_inode+0x27/0xa0
[<ffffffff811a683b>] ? new_inode+0x2b/0xb0
[<ffffffff811f45b4>] ? proc_pid_make_inode+0x14/0xb0
[<ffffffff811f5f11>] ? proc_pident_instantiate+0x21/0xc0
[<ffffffff811f610e>] ? proc_pident_lookup+0x8e/0xd0
[<ffffffff811f61aa>] ? proc_tgid_base_lookup+0x1a/0x20
[<ffffffff811986f5>] ? do_lookup+0x1a5/0x230
[<ffffffff81199014>] ? __link_path_walk+0x794/0xff0
[<ffffffff81199b2a>] ? path_walk+0x6a/0xe0
[<ffffffff81199d3b>] ? filename_lookup+0x6b/0xc0
[<ffffffff81226076>] ? security_file_alloc+0x16/0x20
[<ffffffff8119b214>] ? do_filp_open+0x104/0xd20
[<ffffffff81060aa3>] ? perf_event_task_sched_out+0x33/0x70
[<ffffffff812824f5>] ? _atomic_dec_and_lock+0x55/0x80
[<ffffffff811a7ea2>] ? alloc_fd+0x92/0x160
[<ffffffff811859e9>] ? do_sys_open+0x69/0x140
[<ffffffff8100c715>] ? math_state_restore+0x45/0x60
[<ffffffff81185b00>] ? sys_open+0x20/0x30
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 30
active_anon:2091 inactive_anon:56 isolated_anon:0
active_file:208587 inactive_file:215722 isolated_file:32
unevictable:0 dirty:213512 writeback:481 unstable:0
free:13225 slab_reclaimable:14428 slab_unreclaimable:10804
mapped:299 shmem:63 pagetables:444 bounce:0
Node 0 DMA free:8220kB min:336kB low:420kB high:504kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:7128kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15276kB mlocked:0kB
dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:332kB
slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:42272 all_unreclaimable?
yes
lowmem_reserve[]: 0 1976 1976 1976
Node 0 DMA32 free:44680kB min:44716kB low:55892kB high:67072kB
active_anon:8364kB inactive_anon:224kB active_file:834348kB
inactive_file:855676kB unevictable:0kB isolated(anon):0kB
isolated(file):256kB present:2024312kB mlocked:0kB dirty:854048kB
writeback:1924kB mapped:1196kB shmem:252kB slab_reclaimable:57380kB
slab_unreclaimable:43216kB kernel_stack:1384kB pagetables:1776kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:9952
all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 5*4kB 3*8kB 3*16kB 4*32kB 1*64kB 2*128kB 0*256kB 1*512kB
3*1024kB 2*2048kB 0*4096kB = 8220kB
Node 0 DMA32: 346*4kB 106*8kB 225*16kB 120*32kB 55*64kB 14*128kB
10*256kB 5*512kB 2*1024kB 1*2048kB 5*4096kB = 44680kB
424434 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
522224 pages RAM
48677 pages reserved
426546 pages shared
31791 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 472] 0 472 2771 274 1 -17 -1000 udevd
[ 1272] 0 1272 6910 148 1 -17 -1000 auditd
[ 1312] 32 1312 4744 137 1 0 0 rpcbind
[ 1338] 29 1338 5837 196 0 0 0 rpc.statd
[ 1350] 0 1350 1173 101 0 0 0 mdadm
[ 1423] 0 1423 6996 254 1 0 0 rpc.mountd
[ 1459] 0 1459 6291 148 0 0 0 rpc.idmapd
[ 1471] 0 1471 126650 252 0 -17 -1000 tgtd
[ 1473] 0 1473 3722 90 0 -17 -1000 tgtd
[ 1557] 0 1557 16652 276 1 -17 -1000 sshd
[ 1565] 38 1565 7680 318 0 0 0 ntpd
[ 1606] 0 1606 29324 235 1 0 0 crond
[ 1621] 0 1621 1016 74 0 0 0 mingetty
[ 1623] 0 1623 1016 74 1 0 0 mingetty
[ 1625] 0 1625 1016 73 0 0 0 mingetty
[ 1627] 0 1627 1016 73 0 0 0 mingetty
[ 1629] 0 1629 1016 73 1 0 0 mingetty
[ 1631] 0 1631 1016 74 1 0 0 mingetty
[ 1638] 0 1638 2770 264 0 -17 -1000 udevd
[ 1639] 0 1639 2770 264 0 -17 -1000 udevd
[ 1889] 0 1889 25090 389 0 0 0 sshd
[ 1891] 500 1891 25090 368 1 0 0 sshd
[ 1892] 500 1892 27076 184 1 0 0 bash
[ 1907] 500 1907 3757 219 0 0 0 top
Out of memory: Kill process 1312 (rpcbind) score 1 or sacrifice child
Killed process 1312, UID 32, (rpcbind) total-vm:18976kB,
anon-rss:248kB, file-rss:300kB
---cut---
or
---cut---
collectd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0,
oom_score_adj=0
collectd cpuset=/ mems_allowed=0
Pid: 1596, comm: collectd Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Call Trace:
[<ffffffff810d0211>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[<ffffffff811225c0>] ? dump_header+0x90/0x1b0
[<ffffffff8122761c>] ? security_real_capable_noaudit+0x3c/0x70
[<ffffffff81122a42>] ? oom_kill_process+0x82/0x2a0
[<ffffffff81122981>] ? select_bad_process+0xe1/0x120
[<ffffffff81122e80>] ? out_of_memory+0x220/0x3c0
[<ffffffff81122438>] ? try_set_zonelist_oom+0xa8/0xe0
[<ffffffff8112f79f>] ? __alloc_pages_nodemask+0x89f/0x8d0
[<ffffffff8116769a>] ? alloc_pages_current+0xaa/0x110
[<ffffffff8111f9b7>] ? __page_cache_alloc+0x87/0x90
[<ffffffff8111f39e>] ? find_get_page+0x1e/0xa0
[<ffffffff81120957>] ? filemap_fault+0x1a7/0x500
[<ffffffff811bea76>] ? __set_page_dirty_buffers+0x46/0xc0
[<ffffffff81149c84>] ? __do_fault+0x54/0x530
[<ffffffff8114a257>] ? handle_pte_fault+0xf7/0xb00
[<ffffffffa0072360>] ? ext4_get_block+0x0/0x120 [ext4]
[<ffffffff8114ae8a>] ? handle_mm_fault+0x22a/0x300
[<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
[<ffffffff811da055>] ? fcntl_setlk+0x75/0x320
[<ffffffff8152d73e>] ? do_page_fault+0x3e/0xa0
[<ffffffff8152aaf5>] ? page_fault+0x25/0x30
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 177
CPU 1: hi: 186, btch: 31 usd: 40
active_anon:3664 inactive_anon:56 isolated_anon:0
active_file:204351 inactive_file:217741 isolated_file:64
unevictable:8 dirty:215292 writeback:777 unstable:0
free:13232 slab_reclaimable:14527 slab_unreclaimable:10875
mapped:631 shmem:63 pagetables:595 bounce:0
Node 0 DMA free:8220kB min:336kB low:420kB high:504kB active_anon:0kB
inactive_anon:0kB active_file:284kB inactive_file:6844kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15276kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:332kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:59200 all_unreclaimable? yes
lowmem_reserve[]: 0 1976 1976 1976
Node 0 DMA32 free:44708kB min:44716kB low:55892kB high:67072kB
active_anon:14656kB inactive_anon:224kB active_file:817120kB
inactive_file:864120kB unevictable:32kB isolated(anon):0kB
isolated(file):256kB present:2024312kB mlocked:32kB dirty:861168kB
writeback:3108kB mapped:2524kB shmem:252kB slab_reclaimable:57776kB
slab_unreclaimable:43500kB kernel_stack:1552kB pagetables:2380kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:359776
all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 5*4kB 3*8kB 3*16kB 4*32kB 1*64kB 2*128kB 2*256kB 2*512kB
2*1024kB 2*2048kB 0*4096kB = 8220kB
Node 0 DMA32: 177*4kB 378*8kB 85*16kB 148*32kB 11*64kB 1*128kB 1*256kB
2*512kB 2*1024kB 1*2048kB 7*4096kB = 44708kB
422228 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
522224 pages RAM
48677 pages reserved
426430 pages shared
33618 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 472] 0 472 2771 278 1 -17 -1000 udevd
[ 1272] 0 1272 6910 175 1 -17 -1000 auditd
[ 1289] 0 1289 63855 312 0 0 0 rsyslogd
[ 1312] 32 1312 4744 173 0 0 0 rpcbind
[ 1338] 29 1338 5837 261 0 0 0 rpc.statd
[ 1350] 0 1350 1173 135 0 0 0 mdadm
[ 1423] 0 1423 6996 324 0 0 0 rpc.mountd
[ 1459] 0 1459 6291 170 0 0 0 rpc.idmapd
[ 1471] 0 1471 126650 294 0 -17 -1000 tgtd
[ 1473] 0 1473 3722 109 0 -17 -1000 tgtd
[ 1557] 0 1557 16652 279 1 -17 -1000 sshd
[ 1565] 38 1565 7680 383 0 0 0 ntpd
[ 1582] 0 1582 1530 79 1 0 0 collectdmon
[ 1583] 0 1583 271094 1445 1 0 0 collectd
[ 1606] 0 1606 29324 288 0 0 0 crond
[ 1621] 0 1621 1016 140 0 0 0 mingetty
[ 1623] 0 1623 1016 140 1 0 0 mingetty
[ 1625] 0 1625 1016 139 0 0 0 mingetty
[ 1627] 0 1627 1016 139 0 0 0 mingetty
[ 1629] 0 1629 1016 139 1 0 0 mingetty
[ 1631] 0 1631 1016 140 1 0 0 mingetty
[ 1638] 0 1638 2770 266 0 -17 -1000 udevd
[ 1639] 0 1639 2770 266 0 -17 -1000 udevd
[ 1640] 0 1640 25090 467 0 0 0 sshd
[ 1642] 500 1642 25090 384 0 0 0 sshd
[ 1643] 500 1643 27076 366 1 0 0 bash
[ 1659] 500 1659 36362 279 1 0 0 su
[ 1660] 0 1660 27107 424 1 0 0 bash
[ 1705] 0 1705 3757 329 0 0 0 top
Out of memory: Kill process 1289 (rsyslogd) score 1 or sacrifice child
Killed process 1289, UID 0, (rsyslogd) total-vm:255420kB,
anon-rss:580kB, file-rss:668kB
---cut---
If I use the CentOS6 kernel (2.6.32.x) on the NFS client too,
everything works fine.
No timeout, no oom killer and the load on the NFS server is all the
time below 1.
Looks like 3.12.x can DoS NFS servers. ;-)
Do you have any ideas?
I can reproduce the issue within an hour...
--
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/