Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems

From: Nikolai Joukov
Date: Sat Dec 16 2006 - 12:40:25 EST

Next message: Randy Dunlap: "Re: [-mm patch] drivers/video/{s3fb,svgalib}.c: possible cleanups"
Previous message: Jeff Garzik: "[git patches] libata fixes"
Next in thread: Nikolai Joukov: "Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> I am looking at filling the net-pipe, and it only reaches 40-75% max, with
> some short 100% bursts, and a slow 10% start. It seems that caching
> somewhat delays the writes, which then batch up and sync at various speeds.
> So you have the cache really hiding slow sync speeds. To tune this, it may
> be helpful to turn off caching, which in turn could surface the actual
> bottlenecks.

Well, RAIF is an external kernel module and as such cannot change the
caching behavior much. The notorious problem of all Linux stackable file
systems is double-caching of data. Every stackable file system caches the
data at its own level and copies it from/to the lower file system's cached
pages when necessary. This has some advantages for file systems that
change the data (e.g., encrypt or compress). However, this effectively
reduces the system's cache memory size by two or more times.

Among all the existing stackable file systems, Tracefs and RAIF have the
highest requirements for low overheads. Here are the solutions to the
double-caching problem that we tried:

1. Redirect read and write requests to the lower file systems directly.
This allows to avoid caching of data at the RAIF level. However, this
optimization must be turned off as soon as a file is mmap'ed to avoid
cache inconsistencies. Also, even if the file is not mmap'ed, RAIF1
still keeps the data copies for all the lower branches. This
optimization is implemented in RAIF but is not turned on by default.
(We strip this and many other #ifdef'ed code fragments from the
code releases automatically.)

2. We cache the data at the RAIF level. When we write to the lower file
systems we do allocate the lower pages and do copy the data but we
also mark the lower pages with PG_reclaim flag before calling the
lower writepage operation. This releases all the lower pages right
after the write completes. This works fine for mmap'ed files and this
is the default RAIF behavior now. This solves the problem for most
workloads that mix reads and writes. For example, it improved
Postmark's performance several times. Unfortunately, this optimization
does not improve performance for big sequential writes - the workload
that you tried. So essentially, you had a quarter of your original page
cache while running your workload.

3. A known ideal solution for this problem is sharing of the cached pages
between file systems. We attempted to do it for Tracefs but the
resulting code is not beautiful and is potentially racy:
<http://marc.theaimsgroup.com/?l=linux-fsdevel&m=113193082115222&w=2>
Unfortunately, for fan-out file systems this solution requires even
more support from the OS. However, this is what most OSs do
(including BSD and Windows) but unfortunately not Linux :-(

> little overhead. So for RAIF to be viable, it needs to have low overhead,
> which doesn't seem impossible to implement, given RAIF's simple but
> beautiful approach.

Thanks a lot!

Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Randy Dunlap: "Re: [-mm patch] drivers/video/{s3fb,svgalib}.c: possible cleanups"
Previous message: Jeff Garzik: "[git patches] libata fixes"
Next in thread: Nikolai Joukov: "Re: [ANNOUNCE] RAIF: Redundant Array of Independent Filesystems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]