Re: NFS / FuseFS Kernel Hangs Bug

From: sascha a.
Date: Thu Oct 01 2015 - 10:45:55 EST


Hello,

Okay, i was wrong with FUSE and NFS thanks for the hint.

About the Problem:
Without digging deep into the kernel sources, your explaination is
more or less that was i thinking about whats happening.
Anyways, the reason why i report the Problem is that during this 120
Seconds (until the Kernel solves this issue by killing (?) the
process) the system is unusable.

What i mean about it:
Its not even possible to ssh on the server, even if /root and /home is
local and should not be affected by the slow NFS Servers.
Also it seems during this period a lot of network connections drop/freeze(?).

Youre completly right when you says, theres no other way/its by design
to wait for the NFS-Response. But in my point of view this 'wait' is
happening on the wrong security level. If im not wrong the current
implementation blocks/hangs tasks in kernelspace, or at least blocks
the scheduler during this period.

2015-10-01 16:24 GMT+02:00 Austin S Hemmelgarn <ahferroin7@xxxxxxxxx>:
> On 2015-10-01 09:06, sascha a. wrote:
>>
>> Hello,
>>
>>
>> I want to report a Bug with NFS / FuseFS.
>>
>> Theres trouble with mounting a NFS FS with FuseFS, if the NFS Server
>> is slowly responding.
>>
>> The problem occurs, if you mount a NFS FS with FuseFS driver for
>> example with this command:
>>
>> mount -t nfs -o vers=3,nfsvers=3,hard,intr,tcp server /dest
>>
>> Working on this nfs overlay works like a charm, as long as the NFS
>> Server is not under heavy load. If it gets under HEAVY load from time
>> to time the kernel hangs (which should in my opinion never ever
>> occur).
>
> OK, before I start on an explanation of why what is happening is happening,
> I should note that unless you're using some special FUSE driver instead of
> the regular NFS tools, you're not using FUSE to mount the NFS share, you're
> using a regular kernel driver.
>
> Now, on to the explanation:
> This behavior is expected and unavoidable for any network filesystem under
> the described conditions. Sync (or any other command that causes access to
> the filesystem that isn't served by the local cache) requires sending a
> command to the server. Sync in particular is _synchronous_ (and it should
> be, otherwise you break the implied data safety from using it), which means
> that it will wait until it gets a reply from the server before it returns,
> which means that if the server is heavily loaded (or just ridiculously
> slow), it will be a while before it returns. On top of this, depending on
> how the server is caching data, it may take a long time to return even on a
> really fast server with no other load.
>
> The stacktrace you posted indicates simply that the kernel noticed that
> 'sync' was in an I/O sleep state (the 'D state' it refers to) for more than
> 120 seconds, which is the default detection timeout for this.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/