architecture of (to be implemented) seals in reiserfs

From: Hans Reiser (hans@reiser.to)
Date: Wed Apr 19 2000 - 19:16:21 EST


David Gould wrote:
>
> On Wed, Apr 19, 2000 at 01:19:06AM -0700, Hans Reiser wrote:
> >
> > In our case, we are choosing between having Alexei work on knfsd interaction
> > with reiserfs, and implementing seals. Seals reduce the granularity of locks in
> > reiserfs to per buffer granularity and avoid holding any locks for the length of
> > a disk I/O. I can describe seals if you or someone else is interested, but
> > first I must get some sleep.:-)
> >
> Please do take care to sleep. But, if you find time I am very interested to
> know more about seals. Perhaps there is some other material you could point
> me to if you don't have time to explain.
>
> Thanks
>
> -dg
>
> --
> David Gould davidg@dnai.com 510.536.1443
> - If simplicity worked, the world would be overrun with insects. -

Ok, I needed to write this up to send it to Alexei anyway, so here it is.

Hans

/* Reiserfs gathers all of the resources it will need prior to
   performing balancing. These resources include such things as
   parent and neighbor nodes of the node being balanced. Gathering
   these resources can sometimes take considerable time, as these
   nodes may or may not be in memory. fix_nodes (fix as in fixate
   nodes) is the name of the routine that gathers the resources needed
   prior to balancing. If the nodes change state (contents) while
   fix_nodes is in progress (or waiting for a node) then fix_nodes
   must repeat its computation of what nodes will need to be fixated
   before balancing. It is undesirable to place a read lock on the
   nodes as fix_nodes is gathering them because that would prevent
   writes to the read-locked nodes for the potentially long period of
   time required to get all needed resources needed for a balancing
   operation into memory.

   Currently we put a generation counter on the entire filesystem, and
   if it is changed by the end of the gathering of nodes we repeat
   fix_nodes from the beginning. We also have code filled with giant
   locks, particularly one lock around all of the balancing code.
   This is inefficient.

   We need per buffer generation counters (the current favored
   implementation of seals, seals as a concept means noticing whether
   a buffer has been modified since one last looked at the buffer,
   there are several possible implementations, generation counters are
   the current favorite) plus per buffer locks. The seals are
   incremented every time the buffer is modified. After the balancing
   calculation and resource gathering (meaning, reading into memory of
   every node which might be modified) is completed, we then place a
   lock on each buffer, and check the seal. If the seal is larger at
   the end of the gathering of resources than at the time the seal was
   "placed" (which really means recorded in this implementation) then
   the calculation is invalidated and repeated. If the calculaton is
   repeated more than MAX_BALANCE_RECALC times then locks are used
   instead of seals during fix_nodes.

   znode digression: We plan on putting these counters and locks not
   on the page struct or the buffer_head struct but on our own struct
   called znodes which we will maintain internal to reiserfs. We need
   a per buffer struct that we are free to modify without disturbing
   others. Our code is uglified by our not being free to record the
   parent in struct buffer_head, so we are also going to put that into
   the znode struct. Our currently under development "reallocate
   block numbers on commit" code needs to be able to find the parent
   of an unformatted node. These znodes not present in 3.6.4, we just
   got them to work and are now starting to use them in an unstable
   reallocate on commit code branch.

   There are both read locks and write locks. I am not sure prior to
   implementation if we need to convert read locks to write locks
   (probably not in the balancing code but maybe elsewhere), but if we
   do then let the following be true. When converting a read lock to
   a write lock, first convert the read lock to a seal, then after
   waiting for the read locks of any other processes to release it
   convert the seal to a write lock. This prevents deadlocks for the
   case in which two processes have read locks on the same node, and
   both convert the read locks into write locks.

   Acquisition of locks but not seals is performed in the order of the
   {device,block number} pair of the nodes locked. THIS IS IMPORTANT
   to avoiding deadlocks. It is not important that we have that
   ordering, it is important that it have an ordering used
   consistently. It seems that Linux might be better off if, like IRIX
   is said to have, it had a master ordered list of locks that all
   programmers employ, update, and carefully never acquire locks with
   a different order than.

   When we start fix_nodes we search for the leaf containing the item
   (direct or indirect or directory) which we will affect in the
   intended operation. We seal that leaf, and then we find its parent
   and seal it. Then we perform more analysis, and we seal the
   parents of the left and right neighbors. Then, if we think we will
   alter the neighbors we seal them. We also potentially seal
   ancestors of the leaf node and neighbors, and neighbors of
   ancestors, depending on the results of our analysis. If our
   balancing analysis indicates that we should not write to a neighbor
   we place a read lock on the parent of the neighbor, otherwise we
   place a write lock on the neighbor and its parent. (Parents of
   nodes currently record in them the free space of their children.
   We depend on this value remaining unchanged during balancing, and
   so it is necessary to read lock the parents of the neighbor nodes.)

 
   Locking is performed in fix_nodes, and remains in place until
   do_balance completes. Sealing is performed in fix_nodes, and seals
   remain until fix_nodes replaces the seals with locks.

   If a node changes in space consumed, perhaps its neighbors must be
   read locked until balancing completes. This point is arguable,
   having a less than perfectly balanced tree is not the end of the
   world.

*/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:15 EST