Re: live kernel upgrades (was: live kernel patching design)

From: Arjan van de Ven
Date: Sun Feb 22 2015 - 19:44:59 EST


There's failover, there's running the core services in VMs (which can
migrate)...
I think 10 seconds is Ingo being a bit exaggerating, since you can
boot a full system in a lot less time than that, and more so if you
know more about the system
(e.g. don't need to spin down and then discover and spin up disks). If
you're talking about inside a VM it's even more extreme than that.


Now, live patching sounds great as ideal, but it may end up being
(mostly) similar like hardware hotplug: Everyone wants it, but nobody
wants to use it
(and just waits for a maintenance window instead). In the hotplug
case, while people say they want it, they're also aware that hardware
hotplug is fundamentally messy, and then nobody wants to do it on that
mission critical piece of hardware outside the maintenance window.
(hotswap drives seem to have been the exception to this, that seems to
have been worked out well enough, but that's replace-with-the-same).
I would be very afraid that hot kernel patching ends up in the same
space: The super-mission-critical folks are what its aimed at, while
those are the exact same folks that would rather wait for the
maintenance window.

There's a lot of logistical issues (can you patch a patched system...
if live patching is a first class citizen you end up with dozens and
dozens of live patches applied, some out of sequence etc etc). There's
the "which patches do I have, and if the first patch for a security
hole was not complete, how do I cope by applying number two. There's
the "which of my 50.000 servers have which patch applied" logistics.

And Ingo is absolutely right: The scope is very fuzzy. Todays bugfix
is tomorrows "oh oops it turns out exploitable".

I will throw a different hat in the ring: Maybe we don't want full
kernel update as step one, maybe we want this on a kernel module
level:
Hot-swap of kernel modules, where a kernel module makes itself go
quiet and serializes its state ("suspend" pretty much), then gets
swapped out (hot) by its replacement,
which then unserializes the state and continues.

If we can do this on a module level, then the next step is treating
more components of the kernel as modules, which is a fundamental
modularity thing.



On Sun, Feb 22, 2015 at 4:18 PM, Dave Airlie <airlied@xxxxxxxxx> wrote:
> On 23 February 2015 at 09:01, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>> On Sun, 22 Feb 2015 20:13:28 +0100 (CET) Jiri Kosina <jkosina@xxxxxxx> wrote:
>>
>>> But if you ask the folks who are hungry for live bug patching, they
>>> wouldn't care.
>>>
>>> You mentioned "10 seconds", that's more or less equal to infinity to them.
>>
>> 10 seconds outage is unacceptable, but we're running our service on a
>> single machine with no failover. Who is doing this??
>
> if I had to guess, telcos generally, you've only got one wire between a phone
> and the exchange and if the switch on the end needs patching it better be fast.
>
> Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/