Red Hat needs this patch (was Re: handling NFSERR_JUKEBOX)

From: Bob Smart (smart@hpc.CSIRO.AU)
Date: Tue Nov 06 2001 - 00:21:14 EST


> Take a look at the nfs sourceforge mailing list. There's a long thread
> about this in the context of a Solaris HSM server. I wrote a patch to
> try to do the right thing on a linux client, but it is not perfect for
> many reasons: http://devrandom.net/lists/archives/2001/6/NFS/0135.html

Thanks again for this patch. I've appended our current version to
save others the trouble of minor modifications. I've just tested
it against 2.4.13-ac7.

Your patch may not be perfect but it is a million times better than
nothing. We have had no problems with it. Linux has a long tradition
of utilizing imperfect software till something better comes along.

This patch is ESSENTIAL to make Linux useable in large enterprises,
because they increasingly commonly utilize hirearchically managed
storage. I strongly urge Alan/Linus/Macello to include this patch
in the stable kernel tree (but marked experimental). Or at least
can the kernel maintainers at Red Hat and other distributions put
it in their offerings: surely you see that you need this for your
big customers.

I understand that some people think the problem should be addressed
in glibc (+ all other places that do open system calls). Well really
this is just a small part of a much wider problem. I've included a
rant below. The bottom line is that default "wait as long as necessary"
behaviour should be provided by the kernel, but there needs to be an
optional mechanism that lets the user know when things will take longer
than expected.

Cheers,

Bob

-----------------------------------------------------------------------------

Handling long latency events
----------------------------

The linux kernel potentially returns the error EJUKEBOX from
some operations. This error means: "This operation might well
succeed eventually, and I've started the task [e.g. get
file from tape] that will cause it to later succeed. However
since it will take a long time I'm returning this error so
that you can: inform the user; wait a while; then retry the op."
Now that NFS servers are starting to use NFSERR_JUKEBOX returns,
(which get turned into EJUKEBOX if you don't have the Wyckoff
patch) this is starting to happen and we find that linux user
mode software doesn't handle it.

Getting a file from tape is just one of many circumstances where
operations sometimes take an unexpectedly long time. Other cases
typcially revolve around network operations where remote servers
are not responding for one reason or another. As use of network
services increases this problem is getting worse.

We need to be able to handle these situations in 2 different ways.
If it is a batch or background operation then most of the time
we just want it to wait. Typical example: a telnet server shouldn't
drop a terminal session just because the client has been out of
touch for an hour because of network problems. On the other hand
if the software is interacting with the user and suddenly there
are delays, then we want to give the user useful feedback and
also allow the user to kill the operation. A typical example is
a telnet client [and the (old?) windows behaviour of locking up
is particularly annoying].

It seems to me that the right unixy solution to this problem is
through signals. The EJUKEBOX idea is regretable. However it has
happened and we need to handle it. I strongly urge that:

 - default behaviour is for the kernel to wait and never return
   EJUKEBOX.

 - however user programs that want to handle EJUKEBOX should
   be able to put themselves into a mode where EJUKEBOX is returned.

This approach has the advantage of minimizing impact on existing
user mode software. If you accept my argument then Pete Wyckoff's
patch is a step in the right direction.

-----------------------------------------------------------------------------

diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/Documentation/Configure.help linux-2.4.13-0.3+jukebox/Documentation/Configure.help
--- SAVE/linux-2.4.13-0.3/Documentation/Configure.help Tue Oct 30 09:19:01 2001
+++ linux-2.4.13-0.3+jukebox/Documentation/Configure.help Sun Nov 4 14:03:59 2001
@@ -14094,6 +14094,23 @@
 
   If unsure, say N.
 
+Auto-retry jukebox error requests (EXPERIMENTAL)
+CONFIG_NFS_V3_JUKEBOX_RETRY
+ Say Y here if you want your NFS client to automatically retry requests
+ to which the server responded with EJUKEBOX (528). Some NFS servers
+ export filesystems which are hierarchical, in that some files may take
+ much longer than others to read or write. These files may be migrated
+ to a slow tape storage system which serves as the bulk capacity for the
+ file system; the fast disk becomes essentially a cache.
+
+ If you try to read a file which has been migrated to tape, the server
+ may return the error EJUKEBOX and begin the process which brings the
+ file from tape to disk. This option enables the kernel to retry
+ the access until the file is ready, rather than passing the error to
+ userspace applications, which is the default.
+
+ If unsure, say N.
+
 Root file system on NFS
 CONFIG_ROOT_NFS
   If you want your Linux box to mount its whole root file system (the
diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/fs/Config.in linux-2.4.13-0.3+jukebox/fs/Config.in
--- SAVE/linux-2.4.13-0.3/fs/Config.in Tue Oct 30 09:18:42 2001
+++ linux-2.4.13-0.3+jukebox/fs/Config.in Sun Nov 4 14:03:59 2001
@@ -95,6 +95,7 @@
    dep_tristate 'InterMezzo file system support (experimental, replicating fs)' CONFIG_INTERMEZZO_FS $CONFIG_INET $CONFIG_EXPERIMENTAL
    dep_tristate 'NFS file system support' CONFIG_NFS_FS $CONFIG_INET
    dep_mbool ' Provide NFSv3 client support' CONFIG_NFS_V3 $CONFIG_NFS_FS
+ dep_mbool ' Auto-retry jukebox error requests' CONFIG_NFS_V3_JUKEBOX_RETRY $CONFIG_NFS_V3
    dep_bool ' Root file system on NFS' CONFIG_ROOT_NFS $CONFIG_NFS_FS $CONFIG_IP_PNP
 
    dep_tristate 'NFS server support' CONFIG_NFSD $CONFIG_INET
diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/fs/nfs/file.c linux-2.4.13-0.3+jukebox/fs/nfs/file.c
--- SAVE/linux-2.4.13-0.3/fs/nfs/file.c Mon Sep 24 02:48:01 2001
+++ linux-2.4.13-0.3+jukebox/fs/nfs/file.c Sun Nov 4 15:17:14 2001
@@ -192,8 +192,19 @@
         if (!inode)
                 return 0;
 
+#ifdef CONFIG_NFS_V3_JUKEBOX_RETRY
+ /* perhaps nfsv3 server returned jukebox, go try again */
+ /* if (PageError(page) && inode->u.nfs_i.async_error == -EJUKEBOX) {
+ result = nfs_readpage_sync(inode->u.nfs_i.file, inode, page); */
+ if (PageError(page) && NFS_I(inode)->async_error == -EJUKEBOX) {
+ result = nfs_readpage_sync(NFS_I(inode)->file, inode, page);
+ } else
+#endif
+ {
         rpages = NFS_SERVER(inode)->rpages;
         result = nfs_pagein_inode(inode, index, rpages);
+ }
+
         if (result < 0)
                 return result;
         return 0;
diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/fs/nfs/nfs3proc.c linux-2.4.13-0.3+jukebox/fs/nfs/nfs3proc.c
--- SAVE/linux-2.4.13-0.3/fs/nfs/nfs3proc.c Tue Oct 2 06:45:37 2001
+++ linux-2.4.13-0.3+jukebox/fs/nfs/nfs3proc.c Sun Nov 4 16:28:09 2001
@@ -4,6 +4,8 @@
  * Client-side NFSv3 procedures stubs.
  *
  * Copyright (C) 1997, Olaf Kirch
+ *
+ * 21 Jun 01 - Retry NFSv3 EJUKEBOX errors by Pete Wyckoff
  */
 
 #include <linux/mm.h>
@@ -56,10 +58,25 @@
         struct nfs3_sattrargs arg = { NFS_FH(inode), sattr, 0, 0 };
         int status;
 
+ for (;;) {
         dprintk("NFS call setattr\n");
         fattr->valid = 0;
- status = rpc_call(NFS_CLIENT(inode), NFS3PROC_SETATTR, &arg, fattr, 0);
- dprintk("NFS reply setattr\n");
+ status = rpc_call(NFS_CLIENT(inode), NFS3PROC_SETATTR, &arg,
+ fattr, 0);
+ dprintk("NFS reply setattr %d\n", status);
+#ifdef CONFIG_NFS_V3_JUKEBOX_RETRY
+ if (status == -EJUKEBOX) {
+ DECLARE_WAIT_QUEUE_HEAD(jukebox_wq);
+ tty_write_message(current->tty,
+ "NFSv3: write waiting for file unmigration\r\n");
+ if (interruptible_sleep_on_timeout(&jukebox_wq,
+ NFS_JUKEBOX_RETRY_TIMEO) == 0)
+ continue; /* but not if interrupted */
+ }
+#endif
+ break;
+ }
+
         return status;
 }
 
@@ -147,10 +164,24 @@
         struct rpc_message msg = { NFS3PROC_READ, &arg, &res, cred };
         int status;
 
+ for (;;) {
         dprintk("NFS call read %d @ %Ld\n", count, (long long)offset);
         fattr->valid = 0;
         status = rpc_call_sync(NFS_CLIENT(inode), &msg, flags);
         dprintk("NFS reply read: %d\n", status);
+#ifdef CONFIG_NFS_V3_JUKEBOX_RETRY
+ if (status == -EJUKEBOX) {
+ DECLARE_WAIT_QUEUE_HEAD(jukebox_wq);
+ tty_write_message(current->tty,
+ "NFSv3: read waiting for file unmigration\r\n");
+ if (interruptible_sleep_on_timeout(&jukebox_wq,
+ NFS_JUKEBOX_RETRY_TIMEO) == 0)
+ continue; /* but not if interrupted */
+ }
+#endif
+ break;
+ }
+
         *eofp = res.eof;
         return status;
 }
@@ -177,7 +208,7 @@
 
         status = rpc_call_sync(NFS_CLIENT(inode), &msg, rpcflags);
 
- dprintk("NFS reply read: %d\n", status);
+ dprintk("NFS reply write: %d\n", status);
         return status < 0? status : res.count;
 }
 
diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/fs/nfs/read.c linux-2.4.13-0.3+jukebox/fs/nfs/read.c
--- SAVE/linux-2.4.13-0.3/fs/nfs/read.c Tue Oct 30 09:18:42 2001
+++ linux-2.4.13-0.3+jukebox/fs/nfs/read.c Sun Nov 4 15:28:38 2001
@@ -81,7 +81,7 @@
 /*
  * Read a page synchronously.
  */
-static int
+int
 nfs_readpage_sync(struct file *file, struct inode *inode, struct page *page)
 {
         struct rpc_cred *cred = NULL;
@@ -449,11 +449,27 @@
                 if (task->tk_status >= 0 && count >= 0) {
                         SetPageUptodate(page);
                         count -= PAGE_CACHE_SIZE;
- } else
+ flush_dcache_page(page);
+ kunmap(page);
+ UnlockPage(page);
+ } else {
                         SetPageError(page);
                 flush_dcache_page(page);
                 kunmap(page);
+#ifdef CONFIG_NFS_V3_JUKEBOX_RETRY
+ /* do not unlock page, let nfs_sync_page retry */
+ if (task->tk_status == -EJUKEBOX) {
+ NFS_I(inode)->async_error = task->tk_status;
+ if (waitqueue_active(&page->wait))
+ wake_up(&page->wait);
+ else /* nobody waiting, just unlock */
                 UnlockPage(page);
+ } else
+#endif
+ {
+ UnlockPage(page);
+ }
+ }
 
                 dprintk("NFS: read (%x/%Ld %d@%Ld)\n",
                         req->wb_inode->i_dev,
@@ -504,8 +520,13 @@
                 goto out_error;
 
         error = -1;
- if (!PageError(page) && NFS_SERVER(inode)->rsize >= PAGE_CACHE_SIZE)
+ if (!PageError(page) && NFS_SERVER(inode)->rsize >= PAGE_CACHE_SIZE) {
+#ifdef CONFIG_NFS_V3_JUKEBOX_RETRY
+ NFS_I(inode)->file = file;
+ NFS_I(inode)->async_error = 0;
+#endif
                 error = nfs_readpage_async(file, inode, page);
+ }
         if (error >= 0)
                 goto out;
 
diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/include/linux/nfs_fs.h linux-2.4.13-0.3+jukebox/include/linux/nfs_fs.h
--- SAVE/linux-2.4.13-0.3/include/linux/nfs_fs.h Tue Oct 30 10:04:16 2001
+++ linux-2.4.13-0.3+jukebox/include/linux/nfs_fs.h Sun Nov 4 14:09:34 2001
@@ -104,6 +104,9 @@
 /* Inode Flags */
 #define NFS_USE_READDIRPLUS(inode) ((NFS_FLAGS(inode) & NFS_INO_ADVISE_RDPLUS) ? 1 : 0)
 
+/* Jukebox retry interval */
+#define NFS_JUKEBOX_RETRY_TIMEO (10 * HZ)
+
 /*
  * These are the default flags for swap requests
  */
@@ -258,6 +261,7 @@
 extern int nfs_readpage(struct file *, struct page *);
 extern int nfs_pagein_inode(struct inode *, unsigned long, unsigned int);
 extern int nfs_pagein_timeout(struct inode *);
+extern int nfs_readpage_sync(struct file *, struct inode *, struct page *);
 
 /*
  * linux/fs/mount_clnt.c
diff -r -u --exclude='*.orig' SAVE/linux-2.4.13-0.3/include/linux/nfs_fs_i.h linux-2.4.13-0.3+jukebox/include/linux/nfs_fs_i.h
--- SAVE/linux-2.4.13-0.3/include/linux/nfs_fs_i.h Tue Oct 30 10:03:46 2001
+++ linux-2.4.13-0.3+jukebox/include/linux/nfs_fs_i.h Sun Nov 4 14:09:25 2001
@@ -75,6 +75,11 @@
 
         /* Credentials for shared mmap */
         struct rpc_cred *mm_cred;
+
+#ifdef CONFIG_NFS_V3_JUKEBOX_RETRY
+ struct file *file;
+ int async_error;
+#endif
 };
 
 /*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Nov 07 2001 - 21:00:28 EST