Re: [PATCH RFC v2 11/27] arm64: mte: Reserve tag storage memory
From: Alexandru Elisei
Date: Wed Dec 13 2023 - 08:05:14 EST
Hi Rob,
On Tue, Dec 12, 2023 at 12:44:06PM -0600, Rob Herring wrote:
> On Tue, Dec 12, 2023 at 10:38 AM Alexandru Elisei
> <alexandru.elisei@xxxxxxx> wrote:
> >
> > Hi Rob,
> >
> > Thank you so much for the feedback, I'm not very familiar with device tree,
> > and any comments are very useful.
> >
> > On Mon, Dec 11, 2023 at 11:29:40AM -0600, Rob Herring wrote:
> > > On Sun, Nov 19, 2023 at 10:59 AM Alexandru Elisei
> > > <alexandru.elisei@xxxxxxx> wrote:
> > > >
> > > > Allow the kernel to get the size and location of the MTE tag storage
> > > > regions from the DTB. This memory is marked as reserved for now.
> > > >
> > > > The DTB node for the tag storage region is defined as:
> > > >
> > > > tags0: tag-storage@8f8000000 {
> > > > compatible = "arm,mte-tag-storage";
> > > > reg = <0x08 0xf8000000 0x00 0x4000000>;
> > > > block-size = <0x1000>;
> > > > memory = <&memory0>; // Associated tagged memory node
> > > > };
> > >
> > > I skimmed thru the discussion some. If this memory range is within
> > > main RAM, then it definitely belongs in /reserved-memory.
> >
> > Ok, will do that.
> >
> > If you don't mind, why do you say that it definitely belongs in
> > reserved-memory? I'm not trying to argue otherwise, I'm curious about the
> > motivation.
>
> Simply so that /memory nodes describe all possible memory and
> /reserved-memory is just adding restrictions. It's also because
> /reserved-memory is what gets handled early, and we don't need
> multiple things to handle early.
>
> > Tag storage is not DMA and can live anywhere in memory.
>
> Then why put it in DT at all? The only reason CMA is there is to set
> the size. It's not even clear to me we need CMA in DT either. The
> reasoning long ago was the kernel didn't do a good job of moving and
> reclaiming contiguous space, but that's supposed to be better now (and
> most h/w figured out they need IOMMUs).
>
> But for tag storage you know the size as it is a function of the
> memory size, right? After all, you are validating the size is correct.
> I guess there is still the aspect of whether you want enable MTE or
> not which could be done in a variety of ways.
Oh, sorry, my bad, I should have been clearer about this. I don't want to
put it in the DT as a "linux,cma" node. But I want it to be managed by CMA.
>
> > In
> > arm64_memblock_init(), the kernel first removes the memory that it cannot
> > address from memblock. For example, because it has been compiled with
> > CONFIG_ARM64_VA_BITS_39=y. And then calls
> > early_init_fdt_scan_reserved_mem().
> >
> > What happens if reserved memory is above what the kernel can address?
>
> I would hope the kernel handles it. That's the kernel's problem unless
> there's some h/w limitation to access some region. The DT can't have
> things dependent on the kernel config.
I would hope so too, that's why I was surprised when I put reserved memory
at 1TB in a 39 bit VA kernel and got a panic.
>
> > From my testing, when the kernel is compiled with 39 bit VA, if I use
> > reserved memory to discover tag storage the lives above the virtua address
> > limit and then I try to use CMA to manage the tag storage memory, I get a
> > kernel panic:
>
> Looks like we should handle that better...
I guess we don't need to tackle that problem right now. I don't know of
many systems in the wild that have memory above the 1TB address.
>
> >> [ 0.000000] Reserved memory: created CMA memory pool at 0x0000010000000000, size 64 MiB
> > [ 0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
> > [ 0.000000] OF: reserved mem: 0x0000010000000000..0x0000010003ffffff (65536 KiB) map reusable linux,cma
> > [..]
> > [ 0.806945] Unable to handle kernel paging request at virtual address 00000001fe000000
> > [ 0.807277] Mem abort info:
> > [ 0.807277] ESR = 0x0000000096000005
> > [ 0.807693] EC = 0x25: DABT (current EL), IL = 32 bits
> > [ 0.808110] SET = 0, FnV = 0
> > [ 0.808443] EA = 0, S1PTW = 0
> > [ 0.808526] FSC = 0x05: level 1 translation fault
> > [ 0.808943] Data abort info:
> > [ 0.808943] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> > [ 0.809360] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> > [ 0.809776] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> > [ 0.810221] [00000001fe000000] user address but active_mm is swapper
> > [..]
> > [ 0.820887] Call trace:
> > [ 0.821027] cma_init_reserved_areas+0xc4/0x378
> >
> > >
> > > You need a binding for this too.
> >
> > By binding you mean having an yaml file in dt-schem [1] describing the tag
> > storage node, right?
>
> Yes, but in the kernel tree is fine.
Cool, thanks.
>
> [...]
>
> > > > +static int __init tag_storage_of_flat_get_range(unsigned long node, const __be32 *reg,
> > > > + int reg_len, struct range *range)
> > > > +{
> > > > + int addr_cells = dt_root_addr_cells;
> > > > + int size_cells = dt_root_size_cells;
> > > > + u64 size;
> > > > +
> > > > + if (reg_len / 4 > addr_cells + size_cells)
> > > > + return -EINVAL;
> > > > +
> > > > + range->start = PHYS_PFN(of_read_number(reg, addr_cells));
> > > > + size = PHYS_PFN(of_read_number(reg + addr_cells, size_cells));
> > > > + if (size == 0) {
> > > > + pr_err("Invalid node");
> > > > + return -EINVAL;
> > > > + }
> > > > + range->end = range->start + size - 1;
> > >
> > > We have a function to read (and translate which you forgot) addresses.
> > > Add what's missing rather than open code your own.
> >
> > I must have missed that there's already a function to read addresses. Would
> > you mind pointing me in the right direction?
>
> drivers/of/fdt_address.c
>
> Though it doesn't provide getting the size, so that will have to be added.
Ok, will do!
>
>
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_get_tag_range(unsigned long node,
> > > > + struct range *tag_range)
> > > > +{
> > > > + const __be32 *reg;
> > > > + int reg_len;
> > > > +
> > > > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > > > + if (reg == NULL) {
> > > > + pr_err("Invalid metadata node");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + return tag_storage_of_flat_get_range(node, reg, reg_len, tag_range);
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_get_memory_range(unsigned long node, struct range *mem)
> > > > +{
> > > > + const __be32 *reg;
> > > > + int reg_len;
> > > > +
> > > > + reg = of_get_flat_dt_prop(node, "linux,usable-memory", ®_len);
> > > > + if (reg == NULL)
> > > > + reg = of_get_flat_dt_prop(node, "reg", ®_len);
> > > > +
> > > > + if (reg == NULL) {
> > > > + pr_err("Invalid memory node");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + return tag_storage_of_flat_get_range(node, reg, reg_len, mem);
> > > > +}
> > > > +
> > > > +struct find_memory_node_arg {
> > > > + unsigned long node;
> > > > + u32 phandle;
> > > > +};
> > > > +
> > > > +static int __init fdt_find_memory_node(unsigned long node, const char *uname,
> > > > + int depth, void *data)
> > > > +{
> > > > + const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> > > > + struct find_memory_node_arg *arg = data;
> > > > +
> > > > + if (depth != 1 || !type || strcmp(type, "memory") != 0)
> > > > + return 0;
> > > > +
> > > > + if (of_get_flat_dt_phandle(node) == arg->phandle) {
> > > > + arg->node = node;
> > > > + return 1;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_get_memory_node(unsigned long tag_node, unsigned long *mem_node)
> > > > +{
> > > > + struct find_memory_node_arg arg = { 0 };
> > > > + const __be32 *memory_prop;
> > > > + u32 mem_phandle;
> > > > + int ret, reg_len;
> > > > +
> > > > + memory_prop = of_get_flat_dt_prop(tag_node, "memory", ®_len);
> > > > + if (!memory_prop) {
> > > > + pr_err("Missing 'memory' property in the tag storage node");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + mem_phandle = be32_to_cpup(memory_prop);
> > > > + arg.phandle = mem_phandle;
> > > > +
> > > > + ret = of_scan_flat_dt(fdt_find_memory_node, &arg);
> > >
> > > Do not use of_scan_flat_dt. It is a relic predating libfdt which can
> > > get a node by phandle directly.
> >
> > I used that because that's what drivers/of/fdt.c uses. With reserved memory
> > I shouldn't need it, because struct reserved_mem already includes a
> > phandle.
>
> Check again. Only some arch/ code (mostly powerpc) uses it. I've
> killed off most of it.
You're right, I think I grep'ed for a different function name in
drivers/of/fdt.c Either way, the message is clear: no of_scan_flat_dt().
>
>
> > > > + if (ret != 1) {
> > > > + pr_err("Associated memory node not found");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + *mem_node = arg.node;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int __init tag_storage_of_flat_read_u32(unsigned long node, const char *propname,
> > > > + u32 *retval)
> > >
> > > If you are going to make a generic function, make it for everyone.
> >
> > Sure. If I still need it, should I put the function in
> > include/linux/of_fdt.h?
>
> Yes.
Noted.
>
> > > > +{
> > > > + const __be32 *reg;
> > > > +
> > > > + reg = of_get_flat_dt_prop(node, propname, NULL);
> > > > + if (!reg)
> > > > + return -EINVAL;
> > > > +
> > > > + *retval = be32_to_cpup(reg);
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static u32 __init get_block_size_pages(u32 block_size_bytes)
> > > > +{
> > > > + u32 a = PAGE_SIZE;
> > > > + u32 b = block_size_bytes;
> > > > + u32 r;
> > > > +
> > > > + /* Find greatest common divisor using the Euclidian algorithm. */
> > > > + do {
> > > > + r = a % b;
> > > > + a = b;
> > > > + b = r;
> > > > + } while (b != 0);
> > > > +
> > > > + return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
> > > > +}
> > > > +
> > > > +static int __init fdt_init_tag_storage(unsigned long node, const char *uname,
> > > > + int depth, void *data)
> > > > +{
> > > > + struct tag_region *region;
> > > > + unsigned long mem_node;
> > > > + struct range *mem_range;
> > > > + struct range *tag_range;
> > > > + u32 block_size_bytes;
> > > > + u32 nid = 0;
> > > > + int ret;
> > > > +
> > > > + if (depth != 1 || !strstr(uname, "tag-storage"))
> > > > + return 0;
> > > > +
> > > > + if (!of_flat_dt_is_compatible(node, "arm,mte-tag-storage"))
> > > > + return 0;
> > > > +
> > > > + if (num_tag_regions == MAX_TAG_REGIONS) {
> > > > + pr_err("Maximum number of tag storage regions exceeded");
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + region = &tag_regions[num_tag_regions];
> > > > + mem_range = ®ion->mem_range;
> > > > + tag_range = ®ion->tag_range;
> > > > +
> > > > + ret = tag_storage_of_flat_get_tag_range(node, tag_range);
> > > > + if (ret) {
> > > > + pr_err("Invalid tag storage node");
> > > > + return ret;
> > > > + }
> > > > +
> > > > + ret = tag_storage_get_memory_node(node, &mem_node);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + ret = tag_storage_of_flat_get_memory_range(mem_node, mem_range);
> > > > + if (ret) {
> > > > + pr_err("Invalid address for associated data memory node");
> > > > + return ret;
> > > > + }
> > > > +
> > > > + /* The tag region must exactly match the corresponding memory. */
> > > > + if (range_len(tag_range) * 32 != range_len(mem_range)) {
> > > > + pr_err("Tag storage region 0x%llx-0x%llx does not cover the memory region 0x%llx-0x%llx",
> > > > + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
> > > > + PFN_PHYS(mem_range->start), PFN_PHYS(mem_range->end));
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + ret = tag_storage_of_flat_read_u32(node, "block-size", &block_size_bytes);
> > > > + if (ret || block_size_bytes == 0) {
> > > > + pr_err("Invalid or missing 'block-size' property");
> > > > + return -EINVAL;
> > > > + }
> > > > + region->block_size = get_block_size_pages(block_size_bytes);
> > > > + if (range_len(tag_range) % region->block_size != 0) {
> > > > + pr_err("Tag storage region size 0x%llx is not a multiple of block size %u",
> > > > + PFN_PHYS(range_len(tag_range)), region->block_size);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > > > + ret = tag_storage_of_flat_read_u32(mem_node, "numa-node-id", &nid);
> > >
> > > I was going to say we already have a way to associate memory nodes
> > > other nodes using "numa-node-id", so the "memory" phandle property is
> > > somewhat redundant. Maybe the tag node should have a numa-node-id.
> > > With that, it looks like you don't even need to access the /memory
> > > node. Avoiding that would be good for 2 reasons. It avoids parsing
> > > memory nodes twice and it's not the kernel's job to validate the DT.
> > > Really, if you want memory info, you should use memblock to get it
> > > because all the special cases of memory layout are handled. For
> > > example you can have memory nodes with multiple 'reg' entries or
> > > multiple memory nodes or both, and then some of those could be
> > > contiguous.
> >
> > I need to have a memory node associated with the tag storage node because
> > there is a static relationship between a page from "normal" memory and its
> > associated tag storage. If the code doesn't know that the memory region
> > A..B has the corresponding tag storage in the region X..Y, then it doesn't
> > know which tag storage to reserve when a page is allocated as tagged.
> >
> > In the example above, assuming that page P is allocated as tagged, the
> > corresponding tag storage page that needs to be reserved is:
> >
> > tag_storage_pfn = (page_to_pfn(P) - PHYS_PFN(A)) / 32* + PHYS_PFN(X)
> >
> > numa-node-id is not enough for this, because as far I know you can have
> > multiple memory regions withing the same numa node.
>
> Okay.
Great, glad we are on the same page.
Thanks,
Alex