[RFC 00/32] State of MARS Reo-Redundancy Module

From: Thomas Schoebel-Theuer
Date: Fri Dec 30 2016 - 18:01:35 EST

Hi together,

here is my traditional annual status report on the development of MARS [1].

In the meantime, the out-of-tree MARS has replaced DRBD as the backbone
of the 1&1 geo-redundancy feature as publicly advertised for 1&1
Shared Hosting Linux (ShaHoLin). MARS is also running on several other
1&1 clusters. Some other people over the world have also seemingly
started to use it.

At 1&1, MARS is now running on more than 2000 Servers and on more
than 2 * 8 petabytes of data, and it has collected more than 20 millions
of operation hours.

The slides [1] are explaining why the sharding architecture supported
by MARS has no problems scaling out to such numbers, while some other
non-MARS and non-blocklevel 1&1 clusters (architecturally called
"Big Clusters" in the slides, although they have less than 1 PB of data)
have seemingly reached their _practical_ scaling limits at their practical
dimensioning and practical workload, although they were originally
advertized as scaling almost "unlimited" in theory, and some people
had seemed to _believe_ this in the past. Some of the reasons for
massive differences in scalability are explained in the slides, and
some more explanations will hopefully follow in 2017.

During 2016, I published several bugfix releases for the stable branch,
and some portability improvements to some newer kernels for the
out-of-tree (OOT) version of MARS at the github repo [2].

There is also a prototype of a prepatch-less WIP-compatibility branch
which is not yet merged with master.

Some minor developments have also started: there is a lab prototye for
md5 checksumming of 4k blocks on the underlying disk devices. This was
motivated by the observation that most operational incidents are due to
hardware defects, and we want to catch them as early as possible.
Conversely, this also implies that MARS is considered more stable than
the hardware, but of course this is _expected_ from a HA solution ;)

In January 2016, I took a few holidays for improving the upstream version
of MARS a little bit, mainly improving some checkpatch issues. After that,
I had to do much other work at 1&1, so unfortunately the development of
this part got stuck at that point. Sorry. The attached code is more or
less just for your personal information that this part is not dead, and
that development will continue.

This autumn, I got a new boss and some new objectives for 2017.

One of the new objectives will involve MARS.

The current ShaHoLin sharding architecture has been diretly migrated
from the former DRBD hardware setup to MARS: it just consists of about
1000 _pairs_ of hard-iron iSCSI storage servers and some hard-iron
standalone servers with local hardware RAIDs. They are hosting about
500 MARS resources (originally DRBD resources) just for the web servers;
there are even more resources at (already virtualized) database servers.

The former should be virtualized during 2017 for reducing the server iron,
likely using LXC and/or KVM.

The resulting future system should increase flexibility by MARS resource
data migration among the _whole_ pool. This means that MARS will give up
the traditional DRBD-like pairing in favour of a new feature: treating all
of the existing storage like one big "virtual LVM-like storage pool".

Notice that MARS's internal _architecture_ can already do this: it allows
for k > 2 replicas, and already has dynamic join-cluster and join-resource
and leave-resource operations which can be easily used for runtime data
migration during operation (while resources are loaded), and even for
very big resources. After adding a new operation "merge-cluster" which
checks that all the resource names are disjoint, this _would_ even work
with the current version of MARS, at least in theory.

However, the current limitation is at the internal _metadata_ updates
(not at the IO data paths): currently all cluster members are exchanging
all _metadata_ with all other nodes, leading to O(n^2) _metadata_
communications. A future version of MARS will reduce this to the
necessary scale (only among the nodes involved in some resources),
and it will communicate the other metadata less frequently and no
longer full-mesh.

As a side note: hopefully I will also get the necessary time for replacing
the current symlink tree by metadata files, which should also improve the
metadata scaling properties.

The goal will be a very low number of MARS clusters (one for US, one
for EU, and both probably split for web hosting versus databases)
consisting of several thousands of nodes. The realtime-critical data
IO paths will remain at the sharding principle, leading to excellent
scalability. Only the _metadata_ updates will follow the "big cluster"
architectural approach, and only as far as necessary. They are not
time-critical anyway.

As always, for the opensource community part of my work: it would be
nice if some other kernel hackers would start joining the MARS
development in 2017, at least for helping me getting it upstream.

I would be excited if I would be invited to the next kernel summit
or a similar meeting.

A happy new year from your devoted


[1] https://github.com/schoebel/mars/blob/master/docu/MARS_GUUG2016.pdf

[2] https://github.com/schoebel/mars

