Re: Kernel 3.4.X NFS server regression

From: bfields
Date: Mon Jun 11 2012 - 11:15:17 EST


On Mon, Jun 11, 2012 at 06:01:13PM +0300, Boaz Harrosh wrote:
> On 06/11/2012 05:29 PM, Jeff Layton wrote:
>
> > On Mon, 11 Jun 2012 16:44:09 +0300
> > Boaz Harrosh <bharrosh@xxxxxxxxxxx> wrote:
> >
> >> On 06/11/2012 04:32 PM, Boaz Harrosh wrote:
> >>
> >>> On 06/11/2012 03:39 PM, Jeff Layton wrote:
> >>>
> >>>>>
> >>>>> But I'm guessing we were wrong to assume that existing setups that
> >>>>> people perceived as working would have that path, because the failures
> >>>>> in the absence of that path were probably less obvious.
> >>>>>
> >>
> >>
> >> One more thing, the most important one. We have already fixed that in the
> >> past and I was hoping the lesson was learned. Apparently it was not, and
> >> we are doomed to do this mistake for ever!!
> >>
> >> What ever crap fails times out and crashes, in the recovery code, we don't
> >> give a dam. It should never affect any Server-client communication.
> >>
> >> When the grace periods ends the clients gates opens period. *Any* error
> >> return from state recovery code must be carefully ignored and normal
> >> operations resumed. At most on error, we move into a mode where any
> >> recovery request from client is accepted, since we don't have any better
> >> data to verify it.
> >>
> >> Please comb recovery code to make sure any catastrophe is safely ignored.
> >> We already did that before and it used to work.
> >>
> >
> > That's not the case, and hasn't ever been AFAICT. The code has changed
> > a bit recently, but the existing behavior in this regard was preserved.
> > From nfs4_check_open_reclaim:
> >
> > return nfsd4_client_record_check(clp) ? nfserr_reclaim_bad : nfs_ok;
> >
> > ...if there is no client record, then the reclaim request fails. Doesn't
> > the RFC mandate that?
> >
>
>
> Regardless of what RFC mandates and what is returned to client, (Which sounds
> very unrobust to me) I'm sure the client handles nfserr_reclaim_bad just
> fine.
>
> It's the server that's getting stuck in its own feet and stops responding.
> That's what I meant. We should always resume normal operations after
> the grace period ends.
>
> I did not see any reports of client getting into trouble because of
> unexpected nfserr_reclaim_bad, did you?

We did have a few bugs in that area, and as far as I know they're fixed
(and have stayed fixed!).

The one other thing we've seen at testing events is clients not sending
reclaim_complete: not only is it mandatory (with state to reclaim or
not), it's actually mandatory for servers to fail further operations
until it's sent. However the problems were all seen with unreleased
client code that the implementors said they'd fix.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/