Re: [PATCH][RFC] mm: Introduce kernelcore=reliable option

From: Kamezawa Hiroyuki
Date: Tue Oct 13 2015 - 05:52:00 EST

Next message: Paul Cercueil: "[PATCHv2 1/2] Documentation: ad5592r: Added devicetree bindings documentation"
Previous message: yalin wang: "Re: [RFC] arm: add __initbss section attribute"
In reply to: Luck, Tony: "RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option"
Next in thread: Luck, Tony: "RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2015/10/09 19:36, Xishi Qiu wrote:

On 2015/10/9 17:24, Kamezawa Hiroyuki wrote:

On 2015/10/09 15:46, Xishi Qiu wrote:

On 2015/10/9 22:56, Taku Izumi wrote:

Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
- allocate kernel memory from reliable region
- allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

Hi Taku,

You mean set non-mirrored memory to movable zone, and set
mirrored memory to normal zone, right? So kernel allocations
will use mirrored memory in normal zone, and user allocations
will use non-mirrored memory in movable zone.

My question is:
1) do we need to change the fallback function?

For *our* requirement, it's not required. But if someone want to prevent
user's memory allocation from NORMAL_ZONE, we need some change in zonelist
walking.

Hi Kame,

So we assume kernel will only use normal zone(mirrored), and users use movable
zone(non-mirrored) first if the memory is not enough, then use normal zone too.

Yes.

2) the mirrored region should locate at the start of normal
zone, right?

Precisely, "not-reliable" range of memory are handled by ZONE_MOVABLE.
This patch does only that.

I mean the mirrored region can not at the middle or end of the zone,
BIOS should report the memory like this,

e.g.
BIOS
node0: 0-4G mirrored, 4-8G mirrored, 8-16G non-mirrored
node1: 16-24G mirrored, 24-32G non-mirrored

OS
node0: DMA DMA32 are both mirrored, NORMAL(4-8G), MOVABLE(8-16G)
node1: NORMAL(16-24G), MOVABLE(24-32G)

I think zones can be overlapped even while they are aligned to MAX_ORDER.

I remember Kame has already suggested this idea. In my opinion,
I still think it's better to add a new migratetype or a new zone,
so both user and kernel could use mirrored memory.

Hi, Xishi.

I and Izumi-san discussed the implementation much and found using "zone"
is better approach.

The biggest reason is that zone is a unit of vmscan and all statistics and
handling the range of memory for a purpose. We can reuse all vmscan and
information codes by making use of zones. Introdcing other structure will be messy.

Yes, add a new zone is better, but it will change much code, so reuse ZONE_MOVABLE
is simpler and easier, right?

I think so. If someone feels difficulty with ZONE_MOVABLE, adding zone will be another job.
(*)Taku-san's bootoption is to specify kernelcore to be placed into reliable memory and
doesn't specify anything about users.

His patch is very simple.

The following plan sounds good to me. Shall we rename the zone name when it is
used for mirrored memory, "movable" is a little confusion.

Maybe. I think it should be another discussion. With this patch and his fake-reliable-memory
patch, everyone can give a try.

For your requirements. I and Izumi-san are discussing following plan.

- Add a flag to show the zone is reliable or not, then, mark ZONE_MOVABLE as not-reliable.
- Add __GFP_RELIABLE. This will allow alloc_pages() to skip not-reliable zone.
- Add madivse() MADV_RELIABLE and modify page fault code's gfp flag with that flag.

like this?
user: madvise()/mmap()/or others -> add vma_reliable flag -> add gfp_reliable flag -> alloc_pages
kernel: use __GFP_RELIABLE flag in buddy allocation/slab/vmalloc...

yes.

Also we can introduce some interfaces in procfs or sysfs, right?

It's based on your use case. I think madvise() will be the 1st choice.

Thanks,
-kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul Cercueil: "[PATCHv2 1/2] Documentation: ad5592r: Added devicetree bindings documentation"
Previous message: yalin wang: "Re: [RFC] arm: add __initbss section attribute"
In reply to: Luck, Tony: "RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option"
Next in thread: Luck, Tony: "RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]