Re: Bug report: kernel paniced when system hibernates

From: Alexandre Ghiti
Date: Thu May 18 2023 - 07:58:25 EST


On Thu, May 18, 2023 at 12:35 PM Conor Dooley
<conor.dooley@xxxxxxxxxxxxx> wrote:
>
> On Thu, May 18, 2023 at 10:41:19AM +0200, Alexandre Ghiti wrote:
> > On Thu, May 18, 2023 at 10:00 AM Conor Dooley
> > <conor.dooley@xxxxxxxxxxxxx> wrote:
> > > On Thu, May 18, 2023 at 12:23:59PM +0530, Anup Patel wrote:
> > > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> > > > > On Wed, May 17, 2023 at 1:28 PM Conor Dooley <conor.dooley@xxxxxxxxxxxxx> wrote:
> > > > > > On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > > > > > > On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > > > On Tue, May 16, 2023 at 11:24 AM Song Shuai <suagrfillet@xxxxxxxxx> wrote:
> > > > > > > > I actually removed this flag a few years ago, and I have to admit that
> > > > > > > > I need to check if that's necessary: the goal of commit 3335068f8721
> > > > > > > > ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > > > > > > > the "right" start of DRAM so that we can align virtual and physical
> > > > > > > > addresses on a 1GB boundary.
> > > > > > > >
> > > > > > > > So I have to check if a nomap region is actually added as a
> > > > > > > > memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > > > > > > > the nomap attributes to the PMP regions, otherwise, I don't think that
> > > > > > > > is a good solution.
> > > > > > >
> > > > > > > So here is the current linear mapping without nomap in openSBI:
> > > > > > >
> > > > > > > ---[ Linear mapping ]---
> > > > > > > 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > > > > > > PMD D A G . . W R V
> > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > PMD D A G . . . R V
> > > > > > >
> > > > > > > And below the linear mapping with nomap in openSBI:
> > > > > > >
> > > > > > > ---[ Linear mapping ]---
> > > > > > > 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > > > > > > PTE D A G . . W R V
> > > > > > > 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > > > > > > PMD D A G . . . R V
> > > > > > >
> > > > > > > So adding nomap does not misalign virtual and physical addresses, it
> > > > > > > prevents the usage of 1GB page for this area though, so that's a
> > > > > > > solution, we just lose this 1GB page here.
> > > > > > >
> > > > > > > But even though that may be the fix, I think we also need to fix that
> > > > > > > in the kernel as it would break compatibility with certain versions of
> > > > > > > openSBI *if* we fix openSBI...So here are a few solutions:
> > > > > > >
> > > > > > > 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > > > > > > before the linear mapping is established (IIUC, those nodes are added
> > > > > > > by openSBI to advertise PMP regions)
> > > > > > > -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > > > > >
> > > > > > AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > > > > > not an option, right?
> > > > >
> > > > > Not sure this is a real regression, I'd rather avoid it, but as
> > > > > mentioned in my first answer, Mike Rapoport showed that it was making
> > > > > no difference performance-wise...
> > >
> > > My point was that if someone has hugepages enabled & we handle this in a
> > > way that causes the first hugepage to be unusable, is that not a
> > > regression? Whether hugepages provide a performance benefit is not
> > > really related to that question AFAICT.
> >
> > Not being able to map certain regions of the linear mapping with a 1GB
> > hugepage will happen, for example the kernel mapping is protected in
> > the linear mapping so that it can't be written: so we can only map
> > this region with 2MB hugepages. A firmware could mark a region as
> > "no-map" and there again we would not be able to use a 1GB hugepage. I
> > don't see that as a regression as the intention is not to *always* use
> > 1GB hugepages, but rather to use them when possible. Does that make
> > sense?
>
> Ah, so it was as I expected - I don't/didn't properly understand
> hugepages. Thanks.
>
> > > Were you suggesting reverting hugepage support entirely in your original
> > > message? If we entirely remove hugepage support, then I guess the first
> > > hugepage cannot be lost..
> >
> > Given Mike Rapoport's recent talk, I think that's an option, yes.
> >
> > >
> > > > > > > 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > > > > > > to PMP regions
> > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > 3. we can use register_nosave_region() to not save the "mmode_resv"
> > > > > > > regions (x86 does that
> > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > > > > > > -> We don't lose the 1GB hugepage \o/
> > > > > > > 4. Given JeeHeng pointer to
> > > > > > > https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > > > > > > we can mark those pages as non-readable and make the hibernation
> > > > > > > process not save those pages
> > > > > > > -> Very late-in-the-day idea, not sure what it's worth, we also
> > > > > > > lose the 1GB hugepage...
> > > > > >
> > > > > > Ditto here re: introducing another regression.
> > > > > >
> > > > > > > To me, the best solution is 3 as it would prepare for other similar
> > > > > > > issues later, it is similar to x86 and it allows us to keep 1GB
> > > > > > > hugepages.
> > > > > > >
> > > > > > > I have been thinking, and to me nomap does not provide anything since
> > > > > > > the kernel should not address this memory range, so if it does, we
> > > > > > > must fix the kernel.
> > > > > > >
> > > > > > > Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > > > > >
> > > > > > #3 would probably get my vote too. It seems like you could use it
> > > > > > dynamically if there was to be a future other provider of "mmode_resv"
> > > > > > regions, rather than doing something location-specific.
> > > > > >
> > > > > > We should probably document these opensbi reserved memory nodes though
> > > > > > in a dt-binding or w/e if we are going to be relying on them to not
> > > > > > crash!
> > > >
> > > > Depending on a particular node name is fragile. If we really need
> > > > information from DT then I suggest adding "no-save-restore" DT
> > > > property in reserved memory nodes.
> > >
> > > We can add whatever properties we like, but where does that leave us for
> > > the systems in the wild where their reserved memory nodes do not contain
> > > a "no-save-restore" property or "no-map"?
> > >
> > > Ideally, yes, we do not depend on the node name and instead use explicit
> > > properties - but I think we may be "forced" to fall back to checking the
> > > node-name to cover the opensbi versions that do not contain one.
> > > LMK if I have missed something there!
> >
> > Yes I agree with you, we can implement Anup's solution #1, but we need
> > to fix the kernel anyway since if we don't, that would make the kernel
> > hibernation support depend on a certain version of openSBI.
>
> It's not unreasonable to have things depend on versions of the SBI
> implementation, but it is if they're not things that can be probed using
> the standard interfaces!
>
> > > > > Yes, you're right, let's see what Atish and Anup think!
> > > >
> > > > I think we have two possible approaches:
> > > >
> > > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > > reserved regions. We were doing this previously but removed
> > > > it later for performance reasons mentioned by Alex. It is also
> > > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > > DT property for firmware reserved regions.
> > > >
> > > > 2) Add a new "no-save-restore" DT property in the reserved
> > > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > > will use this DT property to exclude memory regions from
> > > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > > should do the following:
> > > > 1) Treat all memory having "no-map" DT property as EFI
> > > > reserved memory
> > > > 2) Treat all memory not having "no-map" DT property and
> > > > not having "no-save-restore" DT property as EfiBootServicesData
> > > > 3) Treat all memory not having "no-map" DT property and
> > > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > > (Refer,
> > > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > > >
> > > > Personally, I am leaning towards approach#1 since approach#2
> > > > will require changing DeviceTree specification as well.
> > >
> > > #1 is by far the simpler option of the two, if the consensus is that the
> > > loss of the first hugepage is not a problem (or if it is a problem that
> > > realistically is unavoidable).
> >
> > The "no-map" property does not provide much security anyway: the
> > kernel should not touch a page that is reserved (this is where I may
> > be wrong), so the real fix to this issue is to make the hibernation
> > process not save those pages.
>
> Right, the kernel clearly needs to be able to handle the regions. I, at
> least, was commenting on re-using no-map versus creating new properties
> for this situation.
> I was advocating for re-using the property & changing the kernel so as
> not to touch the regions during hibernation.
>
> > > There's something else I think I might be missing here, given the
> > > scattered nature of the reporting. This is not a problem for a system
> > > that does not implement hibernation, which was only added in v6.4-rc1?
> > >
> > > That would make it not a regression after all. I think I misunderstood
> > > the report on sw-dev to mean that this was a problem generally after
> > > v6.4-rc1, which would have been one. Could someone please confirm that?
> >
> > The problem is only present since v6.4-rc1, that's not a regression,
> > it's just that both patches landed at the same time and gave rise to
> > this issue.
>
> Sick. Glad to be wrong here!
>
> #regzbot resolve: not a regression, feature introduced this cycle
>
> > > If it only affects hibernation, and is not a regression, should we make
> > > ARCH_HIBERNATION_POSSIBLE def_bool n in Kconfig until we have agreed on
> > > a solution for the problem?
>
> Any thoughts on this one?

Ahah, I don't know, I tried to dodge the question :) But if we don't
fix this issue, hibernation won't work so is it enough?