Re: Bug report: kernel paniced when system hibernates
From: Anup Patel
Date: Thu May 18 2023 - 10:04:35 EST
On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <alex@xxxxxxxx> wrote:
>
> On 5/18/23 08:53, Anup Patel wrote:
> > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <conor.dooley@xxxxxxxxxxxxx> wrote:
> >>> Hey Alex,
> >>>
> >>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> >>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> >>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <suagrfillet@xxxxxxxxx> wrote:
> >>>>> I actually removed this flag a few years ago, and I have to admit that
> >>>>> I need to check if that's necessary: the goal of commit 3335068f8721
> >>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> >>>>> the "right" start of DRAM so that we can align virtual and physical
> >>>>> addresses on a 1GB boundary.
> >>>>>
> >>>>> So I have to check if a nomap region is actually added as a
> >>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> >>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
> >>>>> is a good solution.
> >>>> So here is the current linear mapping without nomap in openSBI:
> >>>>
> >>>> ---[ Linear mapping ]---
> >>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> >>>> PMD D A G . . W R V
> >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> >>>> PMD D A G . . . R V
> >>>>
> >>>> And below the linear mapping with nomap in openSBI:
> >>>>
> >>>> ---[ Linear mapping ]---
> >>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> >>>> PTE D A G . . W R V
> >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> >>>> PMD D A G . . . R V
> >>>>
> >>>> So adding nomap does not misalign virtual and physical addresses, it
> >>>> prevents the usage of 1GB page for this area though, so that's a
> >>>> solution, we just lose this 1GB page here.
> >>>>
> >>>> But even though that may be the fix, I think we also need to fix that
> >>>> in the kernel as it would break compatibility with certain versions of
> >>>> openSBI *if* we fix openSBI...So here are a few solutions:
> >>>>
> >>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> >>>> before the linear mapping is established (IIUC, those nodes are added
> >>>> by openSBI to advertise PMP regions)
> >>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> >>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> >>> not an option, right?
> >> Not sure this is a real regression, I'd rather avoid it, but as
> >> mentioned in my first answer, Mike Rapoport showed that it was making
> >> no difference performance-wise...
> >>
> >>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> >>>> to PMP regions
> >>>> -> We don't lose the 1GB hugepage \o/
> >>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
> >>>> regions (x86 does that
> >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> >>>> -> We don't lose the 1GB hugepage \o/
> >>>> 4. Given JeeHeng pointer to
> >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> >>>> we can mark those pages as non-readable and make the hibernation
> >>>> process not save those pages
> >>>> -> Very late-in-the-day idea, not sure what it's worth, we also
> >>>> lose the 1GB hugepage...
> >>> Ditto here re: introducing another regression.
> >>>
> >>>> To me, the best solution is 3 as it would prepare for other similar
> >>>> issues later, it is similar to x86 and it allows us to keep 1GB
> >>>> hugepages.
> >>>>
> >>>> I have been thinking, and to me nomap does not provide anything since
> >>>> the kernel should not address this memory range, so if it does, we
> >>>> must fix the kernel.
> >>>>
> >>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> >>> #3 would probably get my vote too. It seems like you could use it
> >>> dynamically if there was to be a future other provider of "mmode_resv"
> >>> regions, rather than doing something location-specific.
> >>>
> >>> We should probably document these opensbi reserved memory nodes though
> >>> in a dt-binding or w/e if we are going to be relying on them to not
> >>> crash!
> > Depending on a particular node name is fragile. If we really need
> > information from DT then I suggest adding "no-save-restore" DT
> > property in reserved memory nodes.
>
>
> I understand your point, the node name is the only thing I found that
> would work with current opensbi: any other idea what we could use instead?
>
>
> >> Yes, you're right, let's see what Atish and Anup think!
> > I think we have two possible approaches:
> >
> > 1) Update OpenSBI to set "no-map" DT property for firmware
> > reserved regions. We were doing this previously but removed
> > it later for performance reasons mentioned by Alex. It is also
> > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > DT property for firmware reserved regions.
> >
> > 2) Add a new "no-save-restore" DT property in the reserved
> > memory DT bindings. The hibernate support of Linux arch/riscv
> > will use this DT property to exclude memory regions from
> > save-restore. The EFI implementation of EDK2 and U-Boot
> > should do the following:
> > 1) Treat all memory having "no-map" DT property as EFI
> > reserved memory
> > 2) Treat all memory not having "no-map" DT property and
> > not having "no-save-restore" DT property as EfiBootServicesData
> > 3) Treat all memory not having "no-map" DT property and
> > having "no-save-restore" DT property as EfiRuntimeServiceData
> > (Refer,
> > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> >
> > Personally, I am leaning towards approach#1 since approach#2
> > will require changing DeviceTree specification as well.
>
>
> If needed, indeed #1 is the simplest, but I insist, to me it is not
> needed (and we don't have it in the current opensbi), if you have
> another opinion, I'm open to discuss it!
I agree with you, backward compatibility with older firmwares
is important.
Let's go with your proposed change to treat reserved DT nodes
with "mmode_resv*" name as M-mode firmware memory (it could
be any M-mode firmware). We will certainly need to document it
somewhere as an expectation of Linux RISC-V kernel.
@Sunil How about treating "mmode_resv*" as
EfiRuntimeServiceData in EDK2 ? Other reserved memory
nodes can follow the device tree specification.
Regards,
Anup
>
> Thanks for your quick answer Anup,
>
> Alex
>
>
> >
> > Regards,
> > Anup
> >
> >> Thanks for your quick answers Conor and Song, really appreciated!
> >>
> >> Alex
> >>
> >>> Thanks for working on this,
> >>> Conor.
> >>>
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@xxxxxxxxxxxxxxxxxxx
> > http://lists.infradead.org/mailman/listinfo/linux-riscv