Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FSis under heavy write load (massive starvation)
From: Mikulas Patocka
Date: Sat Apr 28 2007 - 18:38:29 EST
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
phase tree filesystems (TUX2); it writes inside normal used structures,
but it marks each structure with generation tags --- when it updates
global table of tags, it atomically makes several structures valid. I
don't know about this idea being used elsewhere.
So how is this generation structure organized ? paper ?
Paper is in CITSA 2006 proceedings (but you likely don't have them and I
signed some statement that I can't post it elsewhere :-( )
Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table ---
that array is on disk and in memory (see struct __spadfs->cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory
too (see struct __spadfs->cc)
* On mount, you load crash count table and crash count from disk to
memory. You increment carsh count on disk (but leave old in memory). You
increment one entry in crash count table - cct[cc] in memory, but leave
old on disk.
* On sync you write all metadata buffers, do write barrier, write one
sector of crash count table from memory to disk and do write
barrier again.
* On unmount, you sync and decrement crash count on disk.
--- so crash count counts crashes --- it is increased each time you mount
and don't unmount.
Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc)
and 16-bit crash count(cc).
* You create directory entry with entry->txc = fs->txc[fs->cc] and
entry->cc = fs->cc
* Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes,
it will load crash count table from disk and there's one-less value than
entry->txc, so the entry will be invalid. It will also run with increased
cc, so it will never touch txc at an old index, so the entry will be valid
forever.
--- if you sync, you write crash count table to disk and directory entry
will be atomically made valid forever (because values in crash count table
never decrease)
In my implementation, the top bit of entry->txc is used to mark whether
the entry is scheduled for adding or delete, so that you can atomically
add one directory entry and delete other.
Space allocation bitmaps or lists are managed in such a way that there are
two copies and cc/txc pair determining which one is valid.
Files are extended in such a way that each file has two "size" entries and
cc/txc pair denoting which one is valid, so that you can atomically
extend/truncate file and mark its space allocated/freed in bitmaps or
lists (BTW. this cc/txc pair is the same one that denotes if the directory
entry is valid and another bit determines one of these two functions ---
to save space).
Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/