Re: [BUG BISECT] NFSv4 client fails on Flush Journal to Persistent Storage

From: Krzysztof Kozlowski
Date: Fri Jun 15 2018 - 10:28:51 EST


On Fri, Jun 15, 2018 at 4:23 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
>
>> On Jun 15, 2018, at 10:07 AM, Krzysztof Kozlowski <krzk@xxxxxxxxxx> wrote:
>>
>> On Fri, Jun 15, 2018 at 2:53 PM, Sudeep Holla <sudeep.holla@xxxxxxx> wrote:
>>> Hi,
>>>
>>> On Thu, Jun 7, 2018 at 12:19 PM, Krzysztof Kozlowski <krzk@xxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> When booting my boards under recent linux-next, I see failures of systemd:
>>>>
>>>> [FAILED] Failed to start Flush Journal to Persistent Storage.
>>>> See 'systemctl status systemd-journal-flush.service' for details.
>>>> Starting Create Volatile Files and Directories...
>>>> [** ] A start job is running for Create Vâ [ 223.209289] nfs:
>>>> server 192.168.1.10 not responding, still trying
>>>> [ 223.209377] nfs: server 192.168.1.10 not responding, still trying
>>>>
>>>> Effectively the boards fails to boot. Example is here:
>>>> https://krzk.eu/#/builders/1/builds/2157
>>>>
>>>
>>> I too encountered the same issue.
>>>
>>>> This was bisected to:
>>>> commit 37ac86c3a76c113619b7d9afe0251bbfc04cb80a
>>>> Author: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> Date: Fri May 4 15:34:53 2018 -0400
>>>>
>>>> SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock
>>>>
>>>> alloc_slot is a transport-specific op, but initializing an rpc_rqst
>>>> is common to all transports. In addition, the only part of initial-
>>>> izing an rpc_rqst that needs serialization is getting a fresh XID.
>>>>
>>>> Move rpc_rqst initialization to common code in preparation for
>>>> adding a transport-specific alloc_slot to xprtrdma.
>>>>
>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>> Signed-off-by: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
>>>>
>>>
>>> Unfortunately, spent time to bisect independently without seeing this
>>> report and got the same culprit.
>>>
>>>>
>>>> Bisect log attached. Full configuration:
>>>> 1. exynos_defconfig
>>>> 2. ARMv7, octa-core, Exynos5422 and Exynos4412 (Odroid XU3, U3 and others)
>>>> 3. NFSv4 client (from Raspberry Pi)
>>>>
>>>
>>> Yes the issue is seen only with NFSv4 client and with latest systemd I think.
>>> My Ubuntu 16.04(32bit FS) is boots fine while 18.04 has the above issue.
>>> Passing nfsv3 in kernel command line makes it work again.
>>
>> Thanks for reply!
>>
>> I test it on systemd versions 236 and 238... and it fails on both.
>> However one board passes always - it is Odroid HC1 with same core
>> configuration as described before. Probably there is some different SW
>> package on it.
>>
>>>> Let me know if you need any more information.
>>>>
>>>
>>> Also I was observing this issue with Linus master branch from
>>> the time the above patch was merged until today. The issue
>>> is no longer seen since this morning however I just enabled lockdep
>>> and got these messages.
>>
>> All recent linux-next fail. Today's Linus' tree (4c5e8fc62d6a ("Merge
>> tag 'linux-kselftest-4.18-rc1-2' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest"))
>> managed to get up on one board but stuck on different board with the
>> same issue.
>>
>> I am quite surprised that there is no response from the author of the
>> commit and this was just moved from next (while failing) to Linus'
>> tree... bringing the issue to mainline now.
>
> Sorry. This morning is the first time I've seen this report, which was
> not To: or Cc'd to me.

D'oh! That's mine mistake. Apparently I missed to put you on CC list.
Sorry for that.


> Since I don't have access to this kind of hardware, I will have to ask
> for your help to perform basic troubleshooting.
>
> Can we start by capturing the network traffic that occurs while you
> reproduce the problem? Use tshark or tcpdump on your NFS server, filter
> on the IP of the client, and send me (or the list) the raw pcap file.

Sure, I'll send you tcpdump without Cc-ing list.

Best regards,
Krzysztof