Re: Bug report: kernel paniced when system hibernates

From: Atish Patra
Date: Wed May 24 2023 - 19:45:55 EST


On Thu, May 18, 2023 at 7:04 AM Anup Patel <apatel@xxxxxxxxxxxxxxxx> wrote:
>
> On Thu, May 18, 2023 at 5:39 PM Alexandre Ghiti <alex@xxxxxxxx> wrote:
> >
> > On 5/18/23 08:53, Anup Patel wrote:
> > > On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> > >> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <conor.dooley@xxxxxxxxxxxxx> wrote:
> > >>> Hey Alex,
> > >>>
> > >>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
> > >>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> > >>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <suagrfillet@xxxxxxxxx> wrote:
> > >>>>> I actually removed this flag a few years ago, and I have to admit that
> > >>>>> I need to check if that's necessary: the goal of commit 3335068f8721
> > >>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
> > >>>>> the "right" start of DRAM so that we can align virtual and physical
> > >>>>> addresses on a 1GB boundary.
> > >>>>>
> > >>>>> So I have to check if a nomap region is actually added as a
> > >>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
> > >>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
> > >>>>> is a good solution.
> > >>>> So here is the current linear mapping without nomap in openSBI:
> > >>>>
> > >>>> ---[ Linear mapping ]---
> > >>>> 0xff60000000000000-0xff60000000200000 0x0000000080000000 2M
> > >>>> PMD D A G . . W R V
> > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > >>>> PMD D A G . . . R V
> > >>>>
> > >>>> And below the linear mapping with nomap in openSBI:
> > >>>>
> > >>>> ---[ Linear mapping ]---
> > >>>> 0xff60000000080000-0xff60000000200000 0x0000000080080000 1536K
> > >>>> PTE D A G . . W R V
> > >>>> 0xff60000000200000-0xff60000000e00000 0x0000000080200000 12M
> > >>>> PMD D A G . . . R V
> > >>>>
> > >>>> So adding nomap does not misalign virtual and physical addresses, it
> > >>>> prevents the usage of 1GB page for this area though, so that's a
> > >>>> solution, we just lose this 1GB page here.
> > >>>>
> > >>>> But even though that may be the fix, I think we also need to fix that
> > >>>> in the kernel as it would break compatibility with certain versions of
> > >>>> openSBI *if* we fix openSBI...So here are a few solutions:
> > >>>>
> > >>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
> > >>>> before the linear mapping is established (IIUC, those nodes are added
> > >>>> by openSBI to advertise PMP regions)
> > >>>> -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
> > >>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
> > >>> not an option, right?
> > >> Not sure this is a real regression, I'd rather avoid it, but as
> > >> mentioned in my first answer, Mike Rapoport showed that it was making
> > >> no difference performance-wise...
> > >>
> > >>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
> > >>>> to PMP regions
> > >>>> -> We don't lose the 1GB hugepage \o/
> > >>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
> > >>>> regions (x86 does that
> > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
> > >>>> -> We don't lose the 1GB hugepage \o/
> > >>>> 4. Given JeeHeng pointer to
> > >>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
> > >>>> we can mark those pages as non-readable and make the hibernation
> > >>>> process not save those pages
> > >>>> -> Very late-in-the-day idea, not sure what it's worth, we also
> > >>>> lose the 1GB hugepage...
> > >>> Ditto here re: introducing another regression.
> > >>>
> > >>>> To me, the best solution is 3 as it would prepare for other similar
> > >>>> issues later, it is similar to x86 and it allows us to keep 1GB
> > >>>> hugepages.
> > >>>>
> > >>>> I have been thinking, and to me nomap does not provide anything since
> > >>>> the kernel should not address this memory range, so if it does, we
> > >>>> must fix the kernel.
> > >>>>
> > >>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
> > >>> #3 would probably get my vote too. It seems like you could use it
> > >>> dynamically if there was to be a future other provider of "mmode_resv"
> > >>> regions, rather than doing something location-specific.
> > >>>
> > >>> We should probably document these opensbi reserved memory nodes though
> > >>> in a dt-binding or w/e if we are going to be relying on them to not
> > >>> crash!
> > > Depending on a particular node name is fragile. If we really need
> > > information from DT then I suggest adding "no-save-restore" DT
> > > property in reserved memory nodes.
> >
> >
> > I understand your point, the node name is the only thing I found that
> > would work with current opensbi: any other idea what we could use instead?
> >
> >
> > >> Yes, you're right, let's see what Atish and Anup think!
> > > I think we have two possible approaches:
> > >
> > > 1) Update OpenSBI to set "no-map" DT property for firmware
> > > reserved regions. We were doing this previously but removed
> > > it later for performance reasons mentioned by Alex. It is also
> > > worth mentioning that ARM Trusted Firmware also sets "no-map"
> > > DT property for firmware reserved regions.
> > >
> > > 2) Add a new "no-save-restore" DT property in the reserved
> > > memory DT bindings. The hibernate support of Linux arch/riscv
> > > will use this DT property to exclude memory regions from
> > > save-restore. The EFI implementation of EDK2 and U-Boot
> > > should do the following:
> > > 1) Treat all memory having "no-map" DT property as EFI
> > > reserved memory
> > > 2) Treat all memory not having "no-map" DT property and
> > > not having "no-save-restore" DT property as EfiBootServicesData
> > > 3) Treat all memory not having "no-map" DT property and
> > > having "no-save-restore" DT property as EfiRuntimeServiceData
> > > (Refer,
> > > https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
> > >
> > > Personally, I am leaning towards approach#1 since approach#2
> > > will require changing DeviceTree specification as well.
> >
> >
> > If needed, indeed #1 is the simplest, but I insist, to me it is not
> > needed (and we don't have it in the current opensbi), if you have
> > another opinion, I'm open to discuss it!
>

The problem with relying on the "mmode_resv" name is that there will be
other use cases where a portion of the memory must be reserved and accessing
that from the kernel will result in fault. CoVE is such a use case where
TSM will probably run from a memory region with confidential memory
which the kernel
must not access.

We have to name it "mmode_resv" as well or mark it as "no-map" which will
present a hole in mappings. We will end up in same 1GB hugepage issue
if we choose
"no-map".

Another option is to use compatible string or label property to indicate
that this memory region is not to be saved/restored during hibernation.
This can be documented in RISC-V DT bindings as well as the booting guide
doc that alex was talking about.


> I agree with you, backward compatibility with older firmwares
> is important.
>
This does break the compatibility with the older firmware w.r.to hibernation
feature. However

It is specific to hibernation only. So hibernation will fail to work
if an user is running kernel > 6.4 but 0.8 < OpenSBI < 1.2

The same problem lies if users use other firmware that don't have
no-map property today. IMO, this can be documented as a known problem.


> Let's go with your proposed change to treat reserved DT nodes
> with "mmode_resv*" name as M-mode firmware memory (it could
> be any M-mode firmware). We will certainly need to document it
> somewhere as an expectation of Linux RISC-V kernel.
>
> @Sunil How about treating "mmode_resv*" as
> EfiRuntimeServiceData in EDK2 ? Other reserved memory
> nodes can follow the device tree specification.
>

Either way, we also need to fix U-Boot to match the behavior. Currently,
it treats any reserved memory without no-map property as EFI_BOOT_SERVICES_DATA.

> Regards,
> Anup
>
> >
> > Thanks for your quick answer Anup,
> >
> > Alex
> >
> >
> > >
> > > Regards,
> > > Anup
> > >
> > >> Thanks for your quick answers Conor and Song, really appreciated!
> > >>
> > >> Alex
> > >>
> > >>> Thanks for working on this,
> > >>> Conor.
> > >>>
> > > _______________________________________________
> > > linux-riscv mailing list
> > > linux-riscv@xxxxxxxxxxxxxxxxxxx
> > > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@xxxxxxxxxxxxxxxxxxx
> http://lists.infradead.org/mailman/listinfo/linux-riscv



--
Regards,
Atish