Re: filesystem transactions API

From: Jamie Lokier
Date: Wed Apr 27 2005 - 10:19:33 EST


Ville Herva wrote:
> > How do we specify which calls belong to a transaction? By some kind of
> > extra file handle?
> >
> > I'd think having global per-process transaction is not the best way.
> > So I think we should have some kind of transaction handle (probably in
> > the file handle space) and a way to say that a syscall is done within
> > a transaction. To avoid duplicating all syscalls, we could have
> > set_active_transaction() operation.
>
> That's more or less what NTFS does. See the example at
> http://blogs.msdn.com/because_we_can/

That's the obvious choice but it limits the usefulness quite a lot.

If we have transactions, then I'd like to be able to do this from a shell:

transaction_open t

tar xvpSfz blahblah.tar.gz
cd blahblah
patch -p1 -E < foo.patch
# etc.

transaction_close $t

I'd also like to write inside a single C program:

transaction * t = transaction_open ();

/* Ordinary complicated filesystem operations here... */
link (a, b);
rename (c, d);
read, write, stat etc.
conf = open ("/etc/blahblah.conf", O_RDONLY);
read (conf, ...)
close (conf);
/* If /etc/blahblah.conf is changed by another program during
the transaction, the transaction is invalidated, because the
dbm update below is dependent on what was read... */
dbm_open (...);
do_dbm_stuff (...);
dbm_close (...);
/* Whatever this command does, I'd like to include in the transaction. */
system ("perl -pi -e 's/old_value/new_value/g' /etc/another.conf");

transaction_close (t);

Fundamentally, if transactions are supported in the kernel then these
two usages are easy to offer:

1. Ordinary file system calls as part of a transaction.

This allows libraries which are not transaction-aware to be
used, such as the dbm example above, and other things like XML
parsers/writers.

2. Subprocesses inherit a transaction, so a program can execute
complex transactions by using other programs.

It's useful, and there is no good reason to disallow that.

Nonetheless, there's a need for some kind of transaction handles. A
file descriptor representing a transaction seems like a natural fit.

Complex programs will want to have multiple transactions at the same
time: For example, any program structured using event-driven logic or
async I/O may have multiple independent state machines per thread,
each wanting to be able to have their own transactions.

This suggests a few things:

- Transactions have a file descriptor to represent them.

- Each thread has a "current transaction" that applies to all filesystem
operations.

- Concurrent threads will need their own current transactions, even
while keeping "current directory" global to the whole process for
POSIX reasons. A process wide "current transaction" is too coarse.

- Transactions should be automatically nestable: a program or
library which uses transactions should itself be callable from a
program or library which is using a transaction.

- Transactions should record whether they cannot provide
transactions for some operation that is attempted (e.g. writing to
a file on a remote filesystem), aborting the transaction.

- When a transaction aborts due to the actions of _another_ process
(or thread) which is outside the transaction, that abort is an
event which should be detectable synchronously (by polling the
transaction fd) or asynchronously (by a signal - the SIGIO
mechanism is fine for this).

- An exclusive locking period should be optional, requested by a
flag when opening the transaction. Most usages will want the
locking period with its default parameters.

- Ideally, programs or mechanisms which provide alternative views of
part of a filesystem, such as search results (Beagle), tarfs, or
mailfs, should be able to update synchronously with transactions
that affect whatever the view is watching, so that the view
changes are effectively part of the transaction. This does _not_
mean that a transaction must wait for watchers to calculate
anything. It does mean a transaction must synchronously and
simultaneously invalidate caches held by watchers during the
atomic commit.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/