Re: [PATCH] NLM: Fix "kernel BUG at fs/lockd/host.c:417!" or".../host.c:283!"

From: J. Bruce Fields
Date: Mon Jan 24 2011 - 17:40:41 EST


On Mon, Jan 24, 2011 at 03:50:26PM -0500, Chuck Lever wrote:
> Nick Bowler <nbowler@xxxxxxxxxxxxxxxx> reports:
>
> > We were just having some NFS server troubles, and my client machine
> > running 2.6.38-rc1+ (specifically, commit 2b1caf6ed7b888c95) crashed
> > hard (syslog output appended to this mail).
> >
> > I'm not sure what the exact timeline was or how to reproduce this,
> > but the server was rebooted during all this. Since I've never seen
> > this happen before, it is possibly a regression from previous kernel
> > releases. However, I recently updated my nfs-utils (on the client) to
> > version 1.2.3, so that might be related as well.
>
> [ BUG output redacted ]
>
> When done searching, the for_each_host loop in next_host_state() falls
> through and returns the final host on the host chain without bumping
> it's reference count.
>
> Since the host's ref count is only one at that point, releasing the
> host in nlm_host_rebooted() attempts to destroy the host prematurely,
> and therefore hits a BUG().
>
> Likely, the original intent of the for_each_host behavior in
> next_host_state() was to handle the case when the host chain is empty.
> Searching the chain and finding no suitable host to return needs to be
> handled as well.
>
> Defensively restructure next_host_state() always to return NULL when
> the loop falls through.
>
> Introduced by commit b10e30f6 "lockd: reorganize nlm_host_rebooted".

Whoops, thanks for finding that. The fix looks right to me.

--b.

>
> Cc: J. Bruce Fields <bfields@xxxxxxxxxxxx>
> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> ---
>
> I was able to reproduce this BUG on my client running 2.6.38-rc2
> from earlier today. Here is a proposed fix. I can't reproduce the
> BUG with this patch applied.
>
> However, my reproducer hit the BUG in nlmsvc_release_host(). Nick hit
> roughly the same BUG in nlmclnt_release_host(), which suggests his
> "client" was also acting as a server. The symptoms are similar enough
> that I believe this patch should be sufficient to address both cases.
>
>
> fs/lockd/host.c | 9 +++++----
> 1 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/lockd/host.c b/fs/lockd/host.c
> index 5f1bcb2..b7c99bf 100644
> --- a/fs/lockd/host.c
> +++ b/fs/lockd/host.c
> @@ -520,7 +520,7 @@ static struct nlm_host *next_host_state(struct hlist_head *cache,
> struct nsm_handle *nsm,
> const struct nlm_reboot *info)
> {
> - struct nlm_host *host = NULL;
> + struct nlm_host *host;
> struct hlist_head *chain;
> struct hlist_node *pos;
>
> @@ -532,12 +532,13 @@ static struct nlm_host *next_host_state(struct hlist_head *cache,
> host->h_state++;
>
> nlm_get_host(host);
> - goto out;
> + mutex_unlock(&nlm_host_mutex);
> + return host;
> }
> }
> -out:
> +
> mutex_unlock(&nlm_host_mutex);
> - return host;
> + return NULL;
> }
>
> /**
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/