[ANNOUNCE] Reiser5: Data Tiering. Burst Buffers. Speedup synchronous modifications

From: Edward Shishkin
Date: Mon May 25 2020 - 21:08:48 EST

Reiser5: Data Tiering. Burst Buffers
Speedup synchronous modifications

Dumping peaks of IO load to a proxy device

Now you can add a small high-performance block device to your large
logical volume composed of relatively slow commodity disks and get
an impression that the whole your volume has throughput which is as
high, as the one of that "proxy" device!

This is based on a simple observation that in real life IO load is
going by peaks, and the idea is to dump those peaks to a high-
performance "proxy" device. Usually you have enough time between peaks
to flush the proxy device, that is, to migrate the "hot data" from the
proxy device to slow media in background mode, so that your proxy
device is always ready to accept a new portion of "peaks".

Such technique, which is also known as "Burst Buffers", initially
appeared in the area of HPC. Despite this fact, it is also important
for usual applications. In particular, it allows to speedup the ones,
which perform so-called "atomic updates".

Speedup "atomic updates" in user-space

There is a whole class of applications with high requirements to data
integrity. Such applications (typically data bases) want to be sure
that any data modifications either complete, or they don't. And they
don't appear as partially occurred. Some applications has weaker
requirements: with some restrictions they accept also partially
occurred modifications.

Atomic updates in user space are performed via a sequence of 3 steps.
Suppose you need to modify data of some file "foo" in an atomic way.
For this you need to:

1. write a new temporary file "foo.tmp" with modified data
2. issue fsync(2) against "foo.tmp"
3. rename "foo.tmp" to "foo".

At step 1 the file system populates page cache with new data
At step 2 the file system allocates disk addresses for all logical
blocks of the file foo.tmp and writes that file to disk. At step 3 all
blocks containing old data get released.

Note that steps 2 and 3 become a reason of essential performance drop
on slow media. The situation gets improved, when all dirty data are
written to a dedicated high-performance proxy-disk, which exactly
happens in a file system with Burst Buffers support.

Speedup all synchronous modifications (TODO)
Burst Buffers and transaction manager

Not only dirty data pages, but also dirty meta-data pages can be
dumped to the proxy-device, so that step (3) above also won't
contribute to the performance drop.

Moreover, not only new logical data blocks can be dumped to the proxy
disk. All dirty data pages, including ones, which already have
location on the main (slow) storage can also be relocated to the proxy
disk, thus, speeding up synchronous modification of files in _all_
cases (not only in atomic updates via write-fsync-rename sequence
described above).

Indeed, let's remind that any modified page is always written to disk
in a context of committing some transaction. Depending on the commit
strategy (there are 2 ones "relocate" and "overwrite"), for each such
modified dirty page there are only 2 possibility:

a) to be written right away to a new location,
b) to be written first to a temporary location (journal), then to be
written back to permanent location.

With Burst buffers support in the case (a) the file system writes
dirty page right away to the proxy device. Then user should take care
to migrate it back to the permanent storage (see section "Flushing
proxy devise" below). In the case (b) the modified copy will be
written to the proxy device (wandering logs), then at checkpoint time
(playing a transaction) reiser4 transaction manager will write it to
the permanent location (on commodity disks). In this case user doesn't
need to worry on flushing proxy device, however, the procedure of
commit takes more time, as user should also wait for "checkpoint

So from the standpoint of performance "write-anywhere" transaction
model (reiser4 mount option "txmod=wa") is more preferable then
journalling model (txmod=journal), or even hybrid model (txmod=hybrid)

Predictable and non-predictable migration
Meta-data migration

As we already mentioned, not only dirty data pages, but also dirty
meta-data pages can be dumped to the proxy-device. Note, however, that
not predictable meta-data migration is not possible because of
chicken-eggish problem. Indeed, non-predictable migration means that
nobody knows, on what device of your logical volume a stripe of data
will be relocated in the future. Such migration requires to record
location of data stripes. Now note, that such records is always a part
of meta-data. Hence, you are now able to migrate meta-data in
non-predictable way.

However, it is perfectly possible to distribute/migrate meta-data in a
predictable way (it will be supported in so-called "symmetric" logical
volumes - currently not implemented). Classic example of predictable
migration is RAID arrays (once you add, or remove a device to/from the
array, all data blocks migrate in predictable way during rebalancing).
If relocation is predictable, then it is not need to record locations
of data stripes - it can always be calculated.

Thus, non-predictable migration is applicable to data only.

Definition of data tiering.
Using proxy device to store hot data (TODO)

Now we can precisely define tiering as (meta-)data relocation in
accordance with some strategy (automatic, or user-defined), so that
every relocated unit always gets location on another device-component
of the logical volume.

During such relocation block number B1 on device D1 gets released,
first address component is changed to D2, second component is changed
to 0 (which indicates not allocated block number), then the file
system allocates block number B2 on device D2:

(D1, B1) -> (D2, 0) -> (D2, B2)

Note that tiering is not defined for simple volumes (i.e. volumes,
consisting only of one device). Blocks relocation within one device
is always in a competence of a file system (to be precisely, of block

Burst buffers is just one of strategies, in accordance with which all
new logical blocks (optionally, all dirty pages) always get location
on a dedicated proxy device. As we have figured out, Burst Buffers is
useful for HPC applications, as well as for usual applications
executing fsync(2) frequently.

There are other data tiering strategies, which can be useful for other
class of applications. All of them can be easily implemented in

For example, you can use proxy device to store hot data only. With
such strategy new logical blocks (which are always "cold") will always
go to the main storage (in contrast with Burst Buffers, where new
logical blocks first get written to the proxy disk). Once in a while
you need to scan your volume in order to push colder data out, and
pull hotter data in the proxy disk. Reiser5 contains a common
interface for this. It is possible to maintain per-file, or even per-
blocks-extent "temperature" of data (e.g. as a generation counter),
but we still don't have more or less satisfactory algorithms to
determine "critical temperature" for pushing data in/out proxy disk.

Getting started with proxy disk over logical volume

Just follow the administration guide:

WARNING: THE STUFF IS NOT STABLE! Don't store important data on
Reiser5 logical volumes till beta-stability announcement.