Re: [PATCH] CIFS: Decrease reconnection delay when switching nics

From: Steve French
Date: Thu Feb 28 2013 - 12:46:08 EST


On Thu, Feb 28, 2013 at 11:31 AM, Dave Chiluk <dave.chiluk@xxxxxxxxxxxxx> wrote:
> On 02/28/2013 10:47 AM, Jeff Layton wrote:
>> On Thu, 28 Feb 2013 10:04:36 -0600
>> Steve French <smfrench@xxxxxxxxx> wrote:
>>
>>> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@xxxxxxxxx> wrote:
>>>> On Wed, 27 Feb 2013 16:24:07 -0600
>>>> Dave Chiluk <dave.chiluk@xxxxxxxxxxxxx> wrote:
>>>>
>>>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>>>>>> On Wed, 27 Feb 2013 12:06:14 +0100
>>>>>> "Stefan (metze) Metzmacher" <metze@xxxxxxxxx> wrote:
>>>>>>
>>>>>>> Hi Dave,
>>>>>>>
>>>>>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>>>>>> seconds) since the last response was recieved. This does not take into account
>>>>>>>> the fact that messages waiting for a response should be serviced within a
>>>>>>>> reasonable round trip time.
>>>>>>>
>>>>>>> Wouldn't that mean that the client will disconnect a good connection,
>>>>>>> if the server doesn't response within 10 seconds?
>>>>>>> Reads and Writes can take longer than 10 seconds...
>>>>>>>
>>>>>>
>>>>>> Where does this magic value of 10s come from? Note that a slow server
>>>>>> can take *minutes* to respond to writes that are long past the EOF.
>>>>> It comes from the desire to decrease the reconnection delay to something
>>>>> better than a random number between 60 and 120 seconds. I am not
>>>>> committed to this number, and it is open for discussion. Additionally
>>>>> if you look closely at the logic it's not 10 seconds per request, but
>>>>> actually when requests have been in flight for more than 10 seconds make
>>>>> sure we've heard from the server in the last 10 seconds.
>>>>>
>>>>> Can you explain more fully your use case of writes that are long past
>>>>> the EOF? Perhaps with a test-case or script that I can test? As far as
>>>>> I know writes long past EOF will just result in a sparse file, and
>>>>> return in a reasonable round trip time *(that's at least what I'm seeing
>>>>> with my testing). dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>>>>> seek=100000, starts receiving responses from the server in about .05
>>>>> seconds with subsequent responses following at roughly .002-.01 second
>>>>> intervals. This is well within my 10 second value. Even adding the
>>>>> latency of AT&T's 2g cell network brings it up to only 1s. Still 10x
>>>>> less than my 10 second value.
>>>>>
>>>>> The new logic goes like this
>>>>> if( we've been expecting a response from the server (in_flight), and
>>>>> message has been in_flight for more than 10 seconds and
>>>>> we haven't had any other contact from the server in that time
>>>>> reconnect
>>>>>
>>>>
>>>> That will break writes long past the EOF. Note too that reconnects on
>>>> CIFS are horrifically expensive and problematic. Much of the state on a
>>>> CIFS mount is tied to the connection. When that drops, open files are
>>>> closed and things like locks are dropped. SMB1 has no real mechanism
>>>> for state recovery, so that can really be a problem.
>>>>
>>>>> On a side note, I discovered a small race condition in the previous
>>>>> logic while working on this, that my new patch also fixes.
>>>>> 1s request
>>>>> 2s response
>>>>> 61.995 echo job pops
>>>>> 121.995 echo job pops and sends echo
>>>>> 122 server_unresponsive called. Finds no response and attempts to
>>>>> reconnect
>>>>> 122.95 response to echo received
>>>>>
>>>>
>>>> Sure, here's a reproducer. Do this against a windows server, preferably
>>>> one exporting NTFS on relatively slow storage. Make sure that
>>>> "testfile" doesn't exist first:
>>>>
>>>> $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
>>>>
>>>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
>>>> point where you're writing. That can take a looooong time on slow
>>>> storage (minutes even). What we do now is periodically send a SMB echo
>>>> to make sure the server is alive rather than trying to time out a
>>>> particular call.
>>>
>>> Writing past end of file in Windows can be very slow, but note that it
>>> is possible for a windows to set as sparse a file on an NTFS
>>> partition. Quoting from
>>> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
>>> Windows NTFS does support sparse files (and we could even send it over
>>> cifs if we want) but it has to be explicitly set by the app on the
>>> file:
>>>
>>> "To determine whether a file system supports sparse files, call the
>>> GetVolumeInformation function and examine the
>>> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
>>> lpFileSystemFlags parameter.
>>>
>>> Most applications are not aware of sparse files and will not create
>>> sparse files. The fact that an application is reading a sparse file is
>>> transparent to the application. An application that is aware of
>>> sparse-files should determine whether its data set is suitable to be
>>> kept in a sparse file. After that determination is made, the
>>> application must explicitly declare a file as sparse, using the
>>> FSCTL_SET_SPARSE control code."
>>>
>>>
>>
>> That's interesting. I didn't know about the fsctl.
>>
>> It doesn't really help us though. Not all servers support passthrough
>> infolevels, and there are other filesystems (e.g. FAT) that don't
>> support sparse files at all.
>>
>> In any case, the upshot of all of this is that we simply can't assume
>> that we'll get the response to a particular call in any given amount of
>> time, so we have to periodically check that the server is still
>> responding via echoes before giving up on it completely.
>>
>
> I just verified this by running the dd testcase against a windows 7
> server. I'm going to rewrite my patch to optimise the echo logic as
> Jeff suggested earlier. The only difference being that, I think we
> should still have regular echos when nothing else is happening, so that
> the connection can be rebuilt when nothing urgent is going on.
>
> It still makes more sense to me that we should be checking the status of
> the tcp socket, and it's underlying nic, but I'm still not completely
> clear on how that could be accomplished. Any pointers to that regard
> would be appreciated.

It is also worth checking if the witness protocol would help us (even
in a nonclustered environment) because it was designed to allow (at
least for smb3 mounts) a client to tell when a server is up or down


--
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/