Re: [PATCH v6 02/20] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin
Date: Tue Nov 18 2025 - 10:26:22 EST
On Tue, Nov 18, 2025 at 10:06 AM Mike Rapoport <rppt@xxxxxxxxxx> wrote:
>
> On Tue, Nov 18, 2025 at 10:03:00AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 18, 2025 at 01:21:34PM +0200, Mike Rapoport wrote:
> > > On Mon, Nov 17, 2025 at 11:22:54PM -0500, Pasha Tatashin wrote:
> > > > > You can avoid that complexity if you register the device with a different
> > > > > fops, but that's technicality.
> > > > >
> > > > > Your point about treating the incoming FDT as an underlying resource that
> > > > > failed to initialize makes sense, but nevertheless userspace needs a
> > > > > reliable way to detect it and parsing dmesg is not something we should rely
> > > > > on.
> > > >
> > > > I see two solutions:
> > > >
> > > > 1. LUO fails to retrieve the preserved data, the user gets informed by
> > > > not finding /dev/liveupdate, and studying the dmesg for what has
> > > > happened (in reality in fleets version mismatches should not be
> > > > happening, those should be detected in quals).
> > > > 2. Create a zombie device to return some errno on open, and still
> > > > study dmesg to understand what really happened.
> > >
> > > User should not study dmesg. We need another solution.
> > > What's wrong with e.g. ioctl()?
> >
> > It seems very dangerous to even boot at all if the next kernel doesn't
> > understand the serialization information..
> >
> > IMHO I think we should not even be thinking about this, it is up to
> > the predecessor environment to prevent it from happening. The ideas to
> > use ELF metadata/etc to allow a pre-flight validation are the right
> > solution.
100% agreed, this is the goal.
> > If we get into the next kernel and it receives information it cannot
> > process it should just BUG_ON and die, or some broad equivalent.
I initially had a panic() that would kill the kernel, but after
further consideration, I realized that we can still boot into
"maintenance" mode and allow the user to decide when and how to reboot
the machine back to a normal state.
Crashing during early boot has its own disadvantages: the crash kernel
is not available. Also, because live-update has to be very fast, the
console is likely to be disabled. Therefore, getting to userspace and
allowing the user to investigate what happened (e.g., automatically
retrieving dmesg or a core dump and filing a bug) before rebooting
seems like the most sensible approach.
This won't leak data, as /dev/liveupdate is completely disabled, so
nothing preserved in memory will be recoverable.
Pasha