Re: NFS trace new to 5.13.0 (GA)

From: Chuck Lever III
Date: Tue Jun 29 2021 - 14:38:35 EST


Hi Mike-

> On Jun 29, 2021, at 2:28 PM, Marciniszyn, Mike <mike.marciniszyn@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> During our continuous integration testing on 5.13.0 kernel our testing trips on NFS testing with the following trace on the client:
>
> [32936.156848] INFO: task kworker/9:1:519 blocked for more than 122 seconds.
> [32936.165201] Tainted: G S 5.13.0 #1
> [32936.171562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [32936.180773] task:kworker/9:1 state:D stack: 0 pid: 519 ppid: 2 flags:0x00004000
> [32936.190565] Workqueue: events xprt_destroy_cb [sunrpc]
> [32936.196854] Call Trace:
> [32936.200107] __schedule+0x38e/0x8b0
> [32936.204482] schedule+0x3c/0xa0
> [32936.208464] schedule_timeout+0x215/0x2b0
> [32936.213401] ? check_preempt_curr+0x3f/0x70
> [32936.218518] ? ttwu_do_wakeup+0x17/0x140
> [32936.223336] wait_for_completion+0x98/0xf0
> [32936.228396] __flush_work+0x128/0x1e0
> [32936.232942] ? worker_attach_to_pool+0xb0/0xb0
> [32936.238351] ? work_busy+0x80/0x80
> [32936.242555] __cancel_work_timer+0x110/0x1a0
> [32936.247726] ? xprt_rdma_bc_destroy+0xc6/0xe0 [rpcrdma]
> [32936.254034] xprt_rdma_destroy+0x15/0x50 [rpcrdma]
> [32936.259873] process_one_work+0x1cb/0x360
> [32936.264788] ? process_one_work+0x360/0x360
> [32936.269915] worker_thread+0x30/0x370
> [32936.274436] ? process_one_work+0x360/0x360
> [32936.279526] kthread+0x116/0x130
> [32936.283534] ? __kthread_cancel_work+0x40/0x40
> [32936.288924] ret_from_fork+0x22/0x30
>
> The same tests and same servers see no such issue from rc4 to rc7, so the failure seems new.
>
> Any thoughts?
>
> I'm currently rerunning rc7 just to be sure.

The NFS server in v5.13 is afflicted by a late-breaking bug-fix
to the alloc_pages_bulk_array() API. It's been fixed in Linus'
tree, but that tree is otherwise unstable for me.

Have a look at commit 66d9282523b for a one-liner fix, it should
apply cleanly to v5.13.

--
Chuck Lever