Re: [PATCH 0/12] FS-Cache: Generic filesystem caching facility

From: David Howells
Date: Thu Nov 17 2005 - 14:28:00 EST

Andrew Morton <akpm@xxxxxxxx> wrote:

> > That's the third time you've asked:-)
> Maybe on the fourth or fifth time it'll occur you to put it into the
> changelog.

But that's not what's changed.

So if/when I produce a CacheFiles patch as well, you'll expect a critique of
why that's better than everything else in the changelog for that?

> None of that appears to be relevant.

It rules out the use of i_mapping...

> A blockdev is just a big, fixed-sized file. Why cannot it be backed by a
> big, fixed-sized file?
> <looks>
> OK, it's doing submit_bio() directly.

Using a big fixed-sized file also means that you've got two layout managers
and two transaction managers and two metadata managers on top of each other.

> > This facility isn't well advanced yet, and will initially only be
> > available on EXT2/3. It will also require a userspace component to clean
> > up dead nodes.
> I'd have thought that a decent intermediate step would be
> cache-on-single-file using a_ops.direct_IO, as you're implying above.

That's really the worst of both worlds. If you can access files, then you're
best of doing so on a one cache-file per netfs-file basis, *if* you can get
notification of completion on an asynchronous operation.

If you try to do this, the caching backend will try to lay the blocks out in
a manner that will then be undone because the underlying filesystem will then
put the blocks or parts thereof where *it* wishes.

Furthermore, it would seem that whilst undertaking direct I/O on an inode,
that inode is locked against other direct I/O operations. This could end up
serialising all I/O operations on the cache (see dio_complete() in

> Then all the direct-to-blockdev code can go away. It'll take some tweaking
> of the core direct-io code, but nothing terribly serious.

The direct-to-blockdev code should get you better performance than going
through a single file on a filesystem: with your suggestion, you end up adding
the latency of the cache-to-single-file to that of the underlying filesystem.

There are five main problems that need solving for cachefiles that I can see:

(1) Reading of holes must return ENODATA or a short write. I have a patch to
do this for O_DIRECT (attached).

(2) It must be possible to do O_DIRECT reads/writes directly to/from kernel
pages. This may possible without modification, but I'm not certain of
that; looking at dio_refill_pages() it may not be - that accesses the
current->mm to get more pages.

(3) It must be possible to do these reads and writes asynchronously and to
get notification of their completion. I'm not sure how easy this is, but
it looks like it should be possible, perhaps using a kiocb. The routines
in fs/direct-io.c don't seem to be able to do asynchronicity, except
through AIO.

(4) It must be possible to maintain structural integrity in the cache. This
should be possible simply be relying on the underlying filesystem.

(5) It must be possible to maintain a certain level of data integrity in the
cache. We really don't want to have to blow the entire cache away if the
power goes out or the cache isn't laid to rest correctly.

It may end up being necessary to have a parallel to fs/direct-io.c for doing
I/O asynchronously to/from kernel pages.

Also, fs/direct-io.c seems to assume the filesystem on which it's running uses
buffer_heads - but not all of them do.


diff -uNr linux-2.6.12-rc2-mm1/fs/direct-io.c linux-2.6.12-rc2-mm1-cachefs/fs/direct-io.c
--- linux-2.6.12-rc2-mm1/fs/direct-io.c 2005-04-06 13:48:23.000000000 +0100
+++ linux-2.6.12-rc2-mm1-cachefs/fs/direct-io.c 2005-04-08 10:34:36.778872220 +0100
@@ -790,7 +790,7 @@
struct page *page;
unsigned block_in_page;
struct buffer_head *map_bh = &dio->map_bh;
- int ret = 0;
+ int ret = 0, sent = 0;

/* The I/O can start at any block offset within the first page */
block_in_page = dio->first_block_in_page;
@@ -861,6 +861,14 @@
return -ENOTBLK;
+ else if (dio->iocb->ki_filp->f_flags &
+ ) {
+ page_cache_release(page);
+ if (sent)
+ return 0;
+ return -ENODATA;
+ }

if (dio->block_in_file >=
i_size_read(dio->inode)>>blkbits) {
@@ -907,6 +915,7 @@
goto out;
+ sent = 1;
dio->next_block_for_io += this_chunk_blocks;

dio->block_in_file += this_chunk_blocks;
diff -uNr linux-2.6.12-rc2-mm1/include/asm-i386/fcntl.h linux-2.6.12-rc2-mm1-cachefs/include/asm-i386/fcntl.h
--- linux-2.6.12-rc2-mm1/include/asm-i386/fcntl.h 2004-09-16 12:06:17.000000000 +0100
+++ linux-2.6.12-rc2-mm1-cachefs/include/asm-i386/fcntl.h 2005-04-07 15:46:30.000000000 +0100
@@ -21,6 +21,7 @@
#define O_DIRECTORY 0200000 /* must be a directory */
#define O_NOFOLLOW 0400000 /* don't follow links */
#define O_NOATIME 01000000
+#define O_NOREADHOLE 02000000 /* give short read or ENODATA on a hole */

#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at