Re: filesystem transactions API

From: Jamie Lokier
Date: Tue Apr 26 2005 - 10:53:47 EST


Trond Myklebust wrote:
> > Jamie> No. A transaction means that _all_ processes will see the
> > Jamie> whole transaction or not.
> >
> > This is really hard. How do you handle the case where process X
> > starts a transaction modifies files a, b & c, but process Y has file b
> > open for writing, and never lets it go? Or the file gets unlinked?
>
> That is why implementing it as a form of lock makes sense.

The problem with making them exclusive locks is that you halt the
system for the duration of the transaction. If it's a big transaction
such as updating 1000 files for a package update, that blocks a lot of
programs for a long time, and it's not necessary.

And, because that's a potential denial of service, you have to limit
the size of transactions and their duration, especially for ordinary
users. That makes transactions a lot less useful than they can be.

I would implement them as a combination of time-limited lock, and
abortable transaction with file & directory reads establishing
prerequisites.

While the transaction lock is held, everything read (i.e. read byte
ranges, lock byte ranges, directory lookups, and stat results) cause
the corresponding range or inode to be exclusively locked for this
transaction, and also cause them to be recorded in the prerequisite
set for this transaction. Everything written (i.e. byte ranges or any
other filesystem modifying operation) is queued.

If the transaction lock timeout is reached before the transaction is
closed, all the exlusive locks for this transaction are released, and
the transaction lock itself is released, and the prerequisite set
continues to be recorded.

If at any time, another process tries to modify any of the information
in the transaction's prerequisite set, then firstly: if the
transaction lock is held, the other process is blocked until that lock
is released. Secondly: if the other process successfully modifies
information in the transaction's prerequisite set, the transaction is
aborted. All further operations in this transaction will fail,
including reads, writes, and the final close which commits writes.

Finally, when the transaction is closed, either it fails because
prerequisites were modified, or it commits all the pending filesystem
modifications of this transaction.

Why two phases?

The second phase, with no exclusive locking, is to allow ordinary
users to use transactions without blocking other processes or hogging
excessive system resources. It allows other processes to progress
while a big transaction is in progress. In other words, it prevents
some kinds of denial-of-service, allows arbitrarily large transactions
as long as there's enough space in the filesystem, and is generally
better.

The first phase, with exlusive locking, uses a randomised timeout for
the lock. This is to prevent starvation of transacting processes by
other processes. It's analogous to the problem of readers starving
writers in some kinds of read-write locks. The randomised timeout is
to prevent mutual starvation between two or more transacting
processes, which might otherwise get into synchronised livelock.

Enjoy :)
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/