Process waiting on NFS transitions to uninterruptable sleep when receiving a signal with custom signal handler

From: Stephan
Date: Mon Oct 28 2019 - 04:09:54 EST

Hello everyone,

I have asked this question on Stackoverflow a while ago but
unfortunately nobody had an idea on this.

I am currently doing some research on how we can extend the monitoring
solution for Linux in our datacenter in order to detect inaccessible
NFS mounts. My idea was to look for NFS mounts in /proc/self/mountinfo
and then for each mount, call alarm(), issue a syncronous
interruptible call via stat()/fsstat() or similar, and in case of an
alarm, return an error in the signal handler. However, I experienced
the following behaviour which I am not sure how to explain or debug.

It turned out that when a process waiting in the stat system call on a
mountpoint of a diconnected NFS server, it responds to signals as
expected. For example, one can exit it pressing Strc+C, or it displays
"Alarm clock" and ends when the alarm timer fires. The same applies
e.g. to SIGUSR1/2, leading the program to display "User defined signal
1" (or "2") and end. I suspect these messages come from a general
signal dispatcher inside glibc, but it would be nice to hear some
details on how this works.

In all cases in which a custom signal handler was registered, the
process transitions to an uninterruptible sleep state when a signal
for this custom handler is scheduled; leading to no other signal being
processed anymore. Of course this applies to SIGALRM as well when the
alarm() timer sends the signal. All signals show up in
/proc/PID/status as below:

Threads: 1
SigQ: 4/31339
SigPnd: 0000000000000000
ShdPnd: 0000000000002a02
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000200

I looked at the information from "echo w > /proc/sysrq-trigger" but
there is nothing of help to me:

[26099350.815187] signal D 0000000000000000 0 49633
39989 0x00000084
[26099350.815193] ffff880001d27b88 0000000000000046 ffff88008a8184c0
[26099350.815199] ffff880001d27c18 ffffffff81e0a168 ffffffffa03d1df0
[26099350.815204] ffff880001d27ba0 ffffffff81619dd5 ffff88008a8184c0
[26099350.815209] Call Trace:
[26099350.815213] [<ffffffff81619dd5>] schedule+0x35/0x80
[26099350.815223] [<ffffffffa03d1e0e>] rpc_wait_bit_killable+0x1e/0xa0 [sunrpc]
[26099350.815227] [<ffffffff8161a1ea>] __wait_on_bit+0x5a/0x90
[26099350.815231] [<ffffffff8161a32e>] out_of_line_wait_on_bit+0x6e/0x80
[26099350.815242] [<ffffffffa03d2e7e>] __rpc_execute+0x14e/0x450 [sunrpc]
[26099350.815251] [<ffffffffa03ca089>] rpc_run_task+0x69/0x80 [sunrpc]
[26099350.815259] [<ffffffffa06dd166>]
nfs4_call_sync_sequence+0x56/0x80 [nfsv4]
[26099350.815267] [<ffffffffa06ddc90>] _nfs4_proc_getattr+0xb0/0xc0 [nfsv4]
[26099350.815279] [<ffffffffa06e7c83>] nfs4_proc_getattr+0x53/0xd0 [nfsv4]
[26099350.815288] [<ffffffffa06a37c4>] __nfs_revalidate_inode+0x94/0x2a0 [nfs]
[26099350.815296] [<ffffffffa06a3d7e>] nfs_getattr+0x7e/0x250 [nfs]
[26099350.815303] [<ffffffff8121455a>] vfs_fstatat+0x5a/0x90
[26099350.815306] [<ffffffff812149ca>] SYSC_newstat+0x1a/0x40
[26099350.815312] [<ffffffff8161de61>] entry_SYSCALL_64_fastpath+0x20/0xe9
[26099350.817782] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x20/0xe9

It is also not possible to access anything in usermode as it is not
possible to attach a debugger.

The development happened on SLES 12 SP4, Kernel version 4.4.162-94.72-default.

I am attaching some C and bash code for reproduction, the issue can be
triggered with SIGUSR1 (kill -USR1 PID) or any other one with changes
to the code. As for C, there is no difference in using signal() or
sigaction() to install the handler. The handlers are deliberately left
empty to be sure the is no "forbidden" function called inside.




#include <sys/stat.h>
#include <signal.h>

void sig_handler(int sig)

int main(void) {
int ret;
struct stat buf;
signal(SIGUSR1, sig_handler);

ret = stat("/a", &buf);

return 0;



sighandler() {
declare unused

trap sighandler USR1

[[ -d /a ]] && echo "stat() returned"