[RFC] COLO Proxy Module

From: Li Zhijian
Date: Thu Jun 18 2015 - 02:38:20 EST


Hi, all

We are planning to implement a kernel module called COLO Proxy to buffer and
compare packets. This module is one of the important component of COLO project
and now it is still in early stage, so any comments and feedback are warmly
welcomed, thanks in advance.

=====
# RFC: COLO-Proxy Module

## Rationale

COLO FT/HA (COarse-grain LOck-stepping Virtual Machines for Non-stop Service)
project is a high availability solution. Both Primary VM (PVM) and Secondary VM
(SVM) run in parallel. They receive the same request from client, and generate
responses in parallel too. If the response packets from PVM and SVM are
identical, they are released immediately. Otherwise, a VM checkpoint (on demand)
is conducted.
Paper:
http://www.socc2013.org/home/program/a3-dong.pdf?attredirects=0
COLO on Xen:
http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
COLO on Qemu/KVM:
http://wiki.qemu.org/Features/COLO

By the needs of capturing response packets from PVM and SVM and finding out
whether they are identical, we introduce a new kernel module which called
colo-proxy.

This document describes the design of the colo-proxy module

## Glossary

* PVM - Primary VM, which provides services to clients.
* SVM - Secondary VM, a hot standby and replication of PVM.
* PN - Primary Node, the host which PVM runs on
* SN - Secondary Node, the host which SVM runs on

## Network topology

================================= Normal =====================================
+--------+
| client |
+----+---+
-------------------------+ | + -------------------------+
PN | + | SN|
+-------+ +----[eth0]-----[switch]-----[eth0]---------+ |
|PVM | +---+-+ | | +---+-+ |
| [tap0]--+ br0 | | | | br0 | |
| | +-----+ [eth1]-----[forward]----[eth1]--+ +-----+ |
+-------+ | | | +-------+|
| | | +-----+ | SVM||
[eth2]---[checkpoint]---[eth2] +--+ br1 |-[tap0] ||
| | +-----+ | ||
| | +-------+|
-------------------------+ +--------------------------+
e.g.
PN:
br0: 192.168.0.33
eth1: 192.168.1.33
eth2: 192.168.2.33

SN:
br0: 192.168.0.88
br1: no ip address
eth1: 192.168.1.88
eth2: 192.168.2.88


============================== After failover ================================
+--------+
| client |
+----+---+
-------------------------+ | ---------------------------+
PN (dead) | + | SN (alive)|
+-------+ +----[eth0]--X--[switch]-----[eth0]-------+ |
|PVM | +---+-+ | | +---+-+ |
| [tap0]--+ br0 | | | | br0 +--+ |
| | +-----+ [eth1]--X--[forward]----[eth1] +-----+ | |
+-------+ | | | +-------+|
| | +-----+ | | SVM||
[eth2]-X-[checkpoint]---[eth2] | br1 | +[tap0] ||
| | +-----+ | ||
| | +-------+|
-------------------------+ +--------------------------+

## Network flow

### Receive packets from client (Input)

+------+
|Client|
+---+--+
+-----------------------+ | +------------------------+
|PN | v | SN|
| +---[eth0]<---[switch] | +--------+ |
| +-------+ v | | | SVM | |
| | PVM | +-+-+ | | [tap0] | |
| | [tap0]<-+br0| | | ^ | | |
| | | | +---+ | | | +--------+ |
| +-------+ | | | +-+-------------+ |
| +-------->[eth1]------------->[eth1]--->colo-proxy | |
| copy&forward| | |*Adjust | |
| | | | Client's ack | |
+-----------------------+ +-----+---------------+--+

* colo-proxy on SN:
** Capture the first ack from client, find out the initial seq number of the
tcp connection on PVM. (for seq number adjustment)
** Adjust ack/sack from client until next checkpoint, make sure tcp
connection on SVM won't break.

### Response packets (Output)

+------+
|Client|
+---^--+
+----------------------------+ | +------------------------+
|PN + + | SN|
| +----+ checkpoint +-->[eth0]+-->[switch] | +---------+ |
| |PVM | ^ | | + | + SVM | |
| +-+--+ | v +-+-+ | | [tap0] | |
| | |[tap0]->br0| | | + + | |
+---v--+ | ^ +---+ + + | +---------+ |
||Vhost| | | ++[eth1]<------------+[eth1]<---+v-------------+ |
+---+--+ | | | + + |colo-proxy | |
| | No | |Yes | | | |*Adjust SVM's | |
+---|--------|--|--------|---+ | | Seq number | |
| | identical? | | +------+--------------+--+
| +-v-----+ ^ +-----v-+ |
| |enqueue+---+ |enqueue| |
| +-------+compare +-------+ |
| |
| colo-proxy |
+----------------------------+



* colo-proxy on SN:
** track the initial seq number of the tcp connection on SVM.
(for seq number adjustment)
** Adjust seq number from SVM until next checkpoint.
* colo-proxy on PN:
** enqueue the packets from SVM
** enqueue the packets from PVM
** compare the tcp payload data of these two queue
** if the data is identical, release PVM queue, drop SVM queue
** if the data is not identical, notice the upper layer(userspace tools:
QEMU or libxl on Xen) a checkpoint is needed
** release PVM queue and drop SVM queue at checkpoint

### After failover

At this point, PN is dead, SVM is serving the clients.

#### Receive packets from client (Input)

+------+
|Client|
+---+--+
|
+---v--+
|Switch|
+---+--+
v
+-------------[eth0]--------------+
| |-------+ SN |
| +------v---------+ |
| |colo-proxy | |
| |*Adjust client's| |
| | ack number | |
| +------+---------+ |
| | |
| | +-----------+ |
| | | SVM | |
| +--->[tap0] | |
| | | |
| +-----------+ |
+---------------------------------+

* colo-proxy on SN:
** Adjust the ack/sack number from client, this only applies to the existing
tcp connection.

#### Response packets (Output)

+------+
|Client|
+---^--+
|
+---+--+
|Switch|
+---^--+
+
+-------------[eth0]--------------+
| |-------^ SN |
| +----------------+ |
| |colo-proxy | |
| |*Adjust SVM's | |
| | seq number | |
| +------^---------+ |
| | |
| | +-----------+ |
| | | SVM | |
| +---+[tap0] | |
| | | |
| +-----------+ |
+---------------------------------+

* colo-proxy on SN:
** Adjust the seq number of the packets returned by SVM, this only applies
to the existing tcp connection.

NOTE:
We track the initial seq number of the tcp connection on both PVM/SVM so that
we can calculate the offset when we do the seq adjustment after failover.

## Implementation

We archive our goal by extending nf_conntrack mechanism.

There're 4 kernel modules in colo-proxy:

### nf_conntrack_colo

In this module We add an nf_conntrack extension named 'colo':
<pre>
static struct nf_ct_ext_type nf_ct_colo_extend __read_mostly = {
.len = sizeof(struct nf_conn_colo),
.move = nf_ct_colo_extend_move,
.destroy = nf_ct_colo_extend_destroy,
.align = __alignof__(struct nf_conn_colo),
.id = NF_CT_EXT_COLO,
};
</pre>
This extension hold essential states needed by colo-proxy. e.g. manage the
node status, the tcp connection status.

### xt_PMYCOLO

This module is for PN. It do the following operations:

* Register a xt_target(cooperate with iptables) to initiate the PN node
status, run a kernel thread to compare packets.
<pre>
static struct xt_target colo_primary_tg_regs[] __read_mostly = {
{
.name = "PMYCOLO",
.family = NFPROTO_UNSPEC,
.target = colo_primary_tg,
.checkentry = colo_primary_tg_check,
.destroy = colo_primary_tg_destroy,
.targetsize = sizeof(struct xt_colo_primary_info),
.table = "mangle",
.hooks = (1 << NF_INET_PRE_ROUTING),
.me = THIS_MODULE,
},
};

static int colo_primary_tg_check(const struct xt_tgchk_param *par)
{
/*
* Setup forward device, init primary node status, create kthread for
* packets comparison.
*/
}
</pre>

* Register a nf_queue_handler to enqueue packets sent by PVM.
<pre>
static const struct nf_queue_handler coloqh = {
.outfn = &colo_enqueue_packet,
};
</pre>

* Register some nf hooks to enqueue packets sent by SVM.
<pre>
static struct nf_hook_ops colo_primary_ops[] __read_mostly = {
{
.hook = colo_slaver_queue_hook,
.owner = THIS_MODULE,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_RAW + 1,
},
{
.hook = colo_slaver_queue_hook,
.owner = THIS_MODULE,
.pf = NFPROTO_IPV6,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_RAW + 1,
},
{
.hook = colo_slaver_arp_hook,
.owner = THIS_MODULE,
.pf = NFPROTO_ARP,
.hooknum = NF_ARP_IN,
.priority = NF_IP_PRI_FILTER + 1,
},
};
</pre>

### xt_SECCOLO

This module is for SN. It do the following operations:

* Register a xt_target(cooperate with iptables) to initiate the SN node
status.
<pre>
static struct xt_target colo_secondary_tg_regs[] __read_mostly = {
{
.name = "SECCOLO",
.family = NFPROTO_UNSPEC,
.target = colo_secondary_tg,
.checkentry = colo_secondary_tg_check,
.destroy = colo_secondary_tg_destroy,
.targetsize = sizeof(struct xt_colo_secondary_info),
.table = "mangle",
.hooks = (1 << NF_INET_PRE_ROUTING),
.me = THIS_MODULE,
},
};
</pre>

* Register some nf hooks to track the initial seq number of the tcp
connections on both PVM/SVM, and do the seq adjustment for SVM(by
using the existing nf_conntrack_seqadj module).
<pre>
static struct nf_hook_ops colo_secondary_ops[] __read_mostly = {
{
.hook = colo_secondary_hook,
.owner = THIS_MODULE,
.pf = NFPROTO_IPV4,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_MANGLE + 1,
},
{
.hook = colo_secondary_hook,
.owner = THIS_MODULE,
.pf = NFPROTO_IPV6,
.hooknum = NF_INET_PRE_ROUTING,
.priority = NF_IP_PRI_MANGLE + 1,
},
};
</pre>

### nfnetlink_colo

This module is for communication with the userspace tools like QEMU or libxl.

In this module, add a colo protocol to the existing nfnetlink mechanism.
<pre>
static const struct nfnetlink_subsystem nfulnl_subsys = {
.name = "colo",
.subsys_id = NFNL_SUBSYS_COLO,
.cb_count = NFCOLO_MSG_MAX,
.cb = nfnl_colo_cb,
};

static const struct nfnl_callback nfnl_colo_cb[NFCOLO_MSG_MAX] = {
[NFCOLO_KERNEL_NOTIFY] = { .call = NULL,
.policy = NULL,
.attr_count = 0, },
[NFCOLO_DO_CHECKPOINT] = { .call = colo_do_checkpoint,
.policy = nfnl_colo_policy,
.attr_count = NFNL_COLO_MAX, },
[NFCOLO_DO_FAILOVER] = { .call = colo_do_failover,
.policy = nfnl_colo_policy,
.attr_count = NFNL_COLO_MAX, },
[NFCOLO_PROXY_INIT] = { .call = colo_init_proxy,
.policy = nfnl_colo_policy,
.attr_count = NFNL_COLO_MAX, },
[NFCOLO_PROXY_RESET] = { .call = colo_reset_proxy,
.policy = nfnl_colo_policy,
.attr_count = NFNL_COLO_MAX,},
};
</pre>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/