Re: live kernel upgrades (was: live kernel patching design)

From: Vojtech Pavlik
Date: Mon Feb 23 2015 - 01:36:33 EST

Next message: Joe Perches: "Re: [PATCH] cxl: Remove useless precision specifiers"
Previous message: Matteo Semenzato: "Re: [PATCH] Staging: fbtft: fix whitespace errors"
In reply to: Pavel Machek: "Re: live kernel upgrades (was: live kernel patching design)"
Next in thread: Josh Poimboeuf: "Re: live kernel upgrades (was: live kernel patching design)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Feb 22, 2015 at 03:01:48PM -0800, Andrew Morton wrote:

> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <jkosina@xxxxxxx> wrote:
>
> > But if you ask the folks who are hungry for live bug patching, they
> > wouldn't care.
> >
> > You mentioned "10 seconds", that's more or less equal to infinity to them.
>
> 10 seconds outage is unacceptable, but we're running our service on a
> single machine with no failover. Who is doing this??

This is the most common argument that's raised when live patching is
discussed. "Why do need live patching when we have redundancy?"

People who are asking for live patching typically do have failover in
place, but prefer not to have to use it when they don't have to.

In many cases, the failover just can't be made transparent to the
outside world and there is a short outage. Examples would be legacy
applications which can't run in an active-active cluster and need to be
restarted on failover. Or trading systems, where the calculations must
be strictly serialized and response times are counted in tens of
microseconds.

Another usecase is large HPC clusters, where all nodes have to run
carefully synchronized. Once one gets behind in a calculation cycle,
others have to wait for the results and the efficiency of the whole
cluster goes down. There are people who run realtime on them for
that reason. Dumping all data and restarting the HPC cluster takes a lot
of time and many nodes (out of tens of thousands) may not come back up,
making the restore from media difficult. Doing a rolling upgrade causes
the nodes one by one stall by 10+ seconds, which times 10k is a long
time, too.

And even the case where you have a perfect setup with everything
redundant and with instant failover does benefit from live patching.
Since you have to plan for failure, you have to plan for failure while
patching, too. With live patching you need 2 servers minimum (or N+1),
without you need 3 (or N+2), as one will be offline while during the
upgrade process.

10 seconds of outage may be acceptable in a disaster scenario. Not
necessarily for a regular update scenario.

The value of live patching is in near zero disruption.

--
Vojtech Pavlik
Director SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Joe Perches: "Re: [PATCH] cxl: Remove useless precision specifiers"
Previous message: Matteo Semenzato: "Re: [PATCH] Staging: fbtft: fix whitespace errors"
In reply to: Pavel Machek: "Re: live kernel upgrades (was: live kernel patching design)"
Next in thread: Josh Poimboeuf: "Re: live kernel upgrades (was: live kernel patching design)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]