Re: [PATCH] arm64/numa: Add more vetting in numa_set_distance()

From: Anshuman Khandual
Date: Thu Nov 08 2018 - 09:20:41 EST




On 11/01/2018 04:57 PM, Will Deacon wrote:
> [ Nit: Please wrap your lines when replying -- I've done it for you here ]
>
> On Tue, Oct 30, 2018 at 08:16:21AM +0530, Anshuman Khandual wrote:
>> On 10/29/2018 08:14 PM, John Garry wrote:
>>>>>>> ÂI think we should either factor out the sanity check
>>>>>>>> into a core helper or make the core code robust to these funny configurations.
>>>>>>>
>>>>>>> OK, so to me it would make sense to factor out a sanity check into a core
>>>>>>> helper.
>>>>>>
>>>>>> That, or have the OF code perform the same validation that slit_valid() is
>>>>>> doing for ACPI. I'm just trying to avoid other architectures running into
>>>>>> this problem down the line.
>>>>>>
>>>>>
>>>>> Right, OF code should do this validation job if ACPI is doing it
>>>>> (especially since the DT bindings actually specify the distance
>>>>> rules), and not rely on the arch NUMA code to accept/reject
>>>>> numa_set_distance() combinations.
>>>>
>>>> I would say this particular condition checking still falls under arch NUMA init
>>>> code sanity check like other basic tests what numa_set_distance() currently does
>>>> already but it should not be a necessity for the OF driver to check these.
>>>
>>> The checks in the arch NUMA code mean that invalid inter-node distance
>>> combinations are ignored.
>>
>> Right and should not this new test (from != to && distance == LOCAL_DISTANCE) be
>> one of them as well ? numa_set_distance() updates the table or just throws some
>> warnings while skipping entries it deems invalid. It would be okay to have this
>> new check there in addition to others like this patch suggests.

I missed this response due to some problem in my mail client and re-iterated
some of the same points again on the V2 (https://lkml.org/lkml/2018/11/8/734).
My apologies for the same.

>
> If we're changing numa_set_distance(), I'd actually be inclined to do the
> opposite of what you're suggesting and move as much of this boilerplate
> checking into the core code. Perhaps we could add something like
> check_fw_numa_topology() and have both ACPI and OF call that so that the
> arch doesn't need to worry about malformed tables at all.

Right although I doubt that we could ever have a common check_fw_numa_topology()
check as the semantics for the table and individual entries there in for DT and
ACPI might be different. But I agree what you are saying.

>
>>> However, if any entries in the table are invalid, then the whole table
>>> can be discarded as none of it can be believed, i.e. it's better to
>>> validate the table.
>>>
>>
>> Agreed. slit_valid() on the ACPI parsing is currently enforcing that before
>> acpi_numa_slit_init() which would call into numa_set_distance(). Hence arch
>> NUMA code numa_set_distance() never had the opportunity to do the sanity
>> checks as ACPI slit_valid() has completely invalidated the table.
>>
>> Unlike ACPI path, of_numa_parse_distance_map_v1() does not do any sanity
>> checks on the distance values parse from the "distance-matrix" property
>> and all the checks directly falls on numa_set_distance(). This needs to
>> be fixed in line with ACPI
>>
>> * If (to == from) ---> distance = LOCAL_DISTANCE
>> * If (to != from) ---> distance > LOCAL_DISTANCE
>>
>> At the same time its okay to just enhance numa_set_distance() test coverage
>> to include this new test. If we would have trusted firmware parsing all the
>> way, existing basic checks about node range, distance stuff should not have
>> been there in numa_set_distance(). Hence IMHO even if we fix the OF driver
>> part, we should include this new check there as well.
>
> I don't see what we gain by duplicating the check. In fact, it has a few
> downsides:
>
> (1) It confuses the semantics of the API, because it is no longer clear
> who "owns" the check
>
> (2) It duplicates code in each architecture
>
> (3) Some clever-cloggs will remove at least some of the duplication in
> future

Agreed, it has down sides.

>
> I'm not willing to accept the check in the arm64 code if we update the
> OF code.

Okay, we should remove them instead.

>
> I think the way forward here is for John to fix the crash he reported by
> adding the check to the OF code. If somebody wants to follow up with
> subsequent patches to move more of the checking out of the arch code, then
> we can review that as a separate series.

Okay. Anyways I had some other comments related to semantics checking at
the OF driver level but I can probably send them out as a separate patch
later. Once again it was my bad to miss this response in the first place.