Re: [PATCH] fs/ceph/mds_client: ignore responses for waiting requests

From: Max Kellermann
Date: Wed Mar 08 2023 - 10:17:34 EST

Next message: Guenter Roeck: "Re: [PATCH v4 2/3] drivers: watchdog: Add StarFive Watchdog driver"
Previous message: Mimi Zohar: "Re: [PATCH 03/28] ima: Align ima_post_create_tmpfile() definition with LSM infrastructure"
Next in thread: Xiubo Li: "Re: [PATCH] fs/ceph/mds_client: ignore responses for waiting requests"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Mar 8, 2023 at 4:42 AM Xiubo Li <xiubli@xxxxxxxxxx> wrote:
> How could this happen ?
>
> Since the req hasn't been submitted yet, how could it receive a reply
> normally ?

I have no idea. We have frequent problems with MDS closing the
connection (once or twice a week), and sometimes, this leads to the
WARNING problem which leaves the server hanging. This seems to be some
timing problem, but that MDS connection problem is a different
problem.
My patch just attempts to address the WARNING; not knowing much about
Ceph internals, my idea was that even if the server sends bad reply
packets, the client shouldn't panic.

> It should be a corrupted reply and it lead us to get a incorrect req,
> which hasn't been submitted yet.
>
> BTW, do you have the dump of the corrupted msg by 'ceph_msg_dump(msg)' ?

Unfortunately not - we have already scrubbed the server that had this
problem and rebooted it with a fresh image including my patch. It
seems I don't have a full copy of the kernel log anymore.

Coincidentally, the patch has prevented another kernel hang just a few
minutes ago:

Mar 08 15:48:53 sweb1 kernel: ceph: mds0 caps stale
Mar 08 15:49:13 sweb1 kernel: ceph: mds0 caps stale
Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps went stale, renewing
Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps stale
Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 socket
error on write
Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 session reset
Mar 08 15:49:35 sweb1 kernel: ceph: mds0 closed our session
Mar 08 15:49:35 sweb1 kernel: ceph: mds0 reconnect start
Mar 08 15:49:36 sweb1 kernel: ceph: mds0 reconnect success
Mar 08 15:49:36 sweb1 kernel: ceph: dropping dirty+flushing Fx state
for 0000000064778286 2199046848012
Mar 08 15:49:40 sweb1 kernel: ceph: mdsc_handle_reply on waiting
request tid 1106187
Mar 08 15:49:53 sweb1 kernel: ceph: mds0 caps renewed

Since my patch is already in place, the kernel hasn't checked the
unexpected packet and thus hasn't dumped it....

If you need more information and have a patch with more logging, I
could easily boot those servers with your patch and post that data
next time it happens.

Max

Next message: Guenter Roeck: "Re: [PATCH v4 2/3] drivers: watchdog: Add StarFive Watchdog driver"
Previous message: Mimi Zohar: "Re: [PATCH 03/28] ima: Align ima_post_create_tmpfile() definition with LSM infrastructure"
Next in thread: Xiubo Li: "Re: [PATCH] fs/ceph/mds_client: ignore responses for waiting requests"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]