David Gould wrote:
>
> On Wed, Apr 19, 2000 at 01:19:06AM -0700, Hans Reiser wrote:
> >
> > In our case, we are choosing between having Alexei work on knfsd interaction
> > with reiserfs, and implementing seals. Seals reduce the granularity of locks in
> > reiserfs to per buffer granularity and avoid holding any locks for the length of
> > a disk I/O. I can describe seals if you or someone else is interested, but
> > first I must get some sleep.:-)
> >
> Please do take care to sleep. But, if you find time I am very interested to
> know more about seals. Perhaps there is some other material you could point
> me to if you don't have time to explain.
>
> Thanks
>
> -dg
>
> --
> David Gould davidg@dnai.com 510.536.1443
> - If simplicity worked, the world would be overrun with insects. -
Ok, I needed to write this up to send it to Alexei anyway, so here it is.
Hans
/* Reiserfs gathers all of the resources it will need prior to
performing balancing. These resources include such things as
parent and neighbor nodes of the node being balanced. Gathering
these resources can sometimes take considerable time, as these
nodes may or may not be in memory. fix_nodes (fix as in fixate
nodes) is the name of the routine that gathers the resources needed
prior to balancing. If the nodes change state (contents) while
fix_nodes is in progress (or waiting for a node) then fix_nodes
must repeat its computation of what nodes will need to be fixated
before balancing. It is undesirable to place a read lock on the
nodes as fix_nodes is gathering them because that would prevent
writes to the read-locked nodes for the potentially long period of
time required to get all needed resources needed for a balancing
operation into memory.
Currently we put a generation counter on the entire filesystem, and
if it is changed by the end of the gathering of nodes we repeat
fix_nodes from the beginning. We also have code filled with giant
locks, particularly one lock around all of the balancing code.
This is inefficient.
We need per buffer generation counters (the current favored
implementation of seals, seals as a concept means noticing whether
a buffer has been modified since one last looked at the buffer,
there are several possible implementations, generation counters are
the current favorite) plus per buffer locks. The seals are
incremented every time the buffer is modified. After the balancing
calculation and resource gathering (meaning, reading into memory of
every node which might be modified) is completed, we then place a
lock on each buffer, and check the seal. If the seal is larger at
the end of the gathering of resources than at the time the seal was
"placed" (which really means recorded in this implementation) then
the calculation is invalidated and repeated. If the calculaton is
repeated more than MAX_BALANCE_RECALC times then locks are used
instead of seals during fix_nodes.
znode digression: We plan on putting these counters and locks not
on the page struct or the buffer_head struct but on our own struct
called znodes which we will maintain internal to reiserfs. We need
a per buffer struct that we are free to modify without disturbing
others. Our code is uglified by our not being free to record the
parent in struct buffer_head, so we are also going to put that into
the znode struct. Our currently under development "reallocate
block numbers on commit" code needs to be able to find the parent
of an unformatted node. These znodes not present in 3.6.4, we just
got them to work and are now starting to use them in an unstable
reallocate on commit code branch.
There are both read locks and write locks. I am not sure prior to
implementation if we need to convert read locks to write locks
(probably not in the balancing code but maybe elsewhere), but if we
do then let the following be true. When converting a read lock to
a write lock, first convert the read lock to a seal, then after
waiting for the read locks of any other processes to release it
convert the seal to a write lock. This prevents deadlocks for the
case in which two processes have read locks on the same node, and
both convert the read locks into write locks.
Acquisition of locks but not seals is performed in the order of the
{device,block number} pair of the nodes locked. THIS IS IMPORTANT
to avoiding deadlocks. It is not important that we have that
ordering, it is important that it have an ordering used
consistently. It seems that Linux might be better off if, like IRIX
is said to have, it had a master ordered list of locks that all
programmers employ, update, and carefully never acquire locks with
a different order than.
When we start fix_nodes we search for the leaf containing the item
(direct or indirect or directory) which we will affect in the
intended operation. We seal that leaf, and then we find its parent
and seal it. Then we perform more analysis, and we seal the
parents of the left and right neighbors. Then, if we think we will
alter the neighbors we seal them. We also potentially seal
ancestors of the leaf node and neighbors, and neighbors of
ancestors, depending on the results of our analysis. If our
balancing analysis indicates that we should not write to a neighbor
we place a read lock on the parent of the neighbor, otherwise we
place a write lock on the neighbor and its parent. (Parents of
nodes currently record in them the free space of their children.
We depend on this value remaining unchanged during balancing, and
so it is necessary to read lock the parents of the neighbor nodes.)
Locking is performed in fix_nodes, and remains in place until
do_balance completes. Sealing is performed in fix_nodes, and seals
remain until fix_nodes replaces the seals with locks.
If a node changes in space consumed, perhaps its neighbors must be
read locked until balancing completes. This point is arguable,
having a less than perfectly balanced tree is not the end of the
world.
*/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:15 EST