Re: Regression: QCA6390 fails with "mm/page_alloc: place pages to tail in __free_pages_core()"
From: David Hildenbrand
Date: Thu Nov 05 2020 - 06:13:53 EST
> Am 05.11.2020 um 11:42 schrieb Vlastimil Babka <vbabka@xxxxxxx>:
>
> On 11/5/20 10:04 AM, Kalle Valo wrote:
>> (changing the subject, adding more lists and people)
>> Pavel Procopiuc <pavel.procopiuc@xxxxxxxxx> writes:
>>> Op 04.11.2020 om 10:12 schreef Kalle Valo:
>>>> Yeah, it is unfortunately time consuming but it is the best way to get
>>>> bottom of this.
>>>
>>> I have found the commit that breaks things for me, it's
>>> 7fef431be9c9ac255838a9578331567b9dba4477 mm/page_alloc: place pages to
>>> tail in __free_pages_core()
>>>
>>> I've reverted it on top of the 5.10-rc2 and ath11k driver loads fine
>>> and I have wifi working.
>> Oh, very interesting. Thanks a lot for the bisection, otherwise we would
>> have never found out whats causing this.
>> David & mm folks: Pavel noticed that his QCA6390 Wi-Fi 6 device (driver
>> ath11k) failed on v5.10-rc1. After bisecting he found that the commit
>> below causes the regression. I have not been able to reproduce this and
>> for me QCA6390 works fine. I don't know if this needs a specific kernel
>> configuration or what's the difference between our setups.
>> Any ideas what might cause this and how to fix it?
>> Full discussion: http://lists.infradead.org/pipermail/ath11k/2020-November/000501.html
>> commit 7fef431be9c9ac255838a9578331567b9dba4477
>> Author: David Hildenbrand <david@xxxxxxxxxx>
>> AuthorDate: Thu Oct 15 20:09:35 2020 -0700
>> Commit: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> CommitDate: Fri Oct 16 11:11:18 2020 -0700
>> mm/page_alloc: place pages to tail in __free_pages_core()
>
> Let me paste from the ath11k discussion:
>
>> * Relevant errors from the log:
>> # journalctl -b | grep -iP '05:00|ath11k'
>> Nov 02 10:41:26 razor kernel: pci 0000:05:00.0: [17cb:1101] type 00 class 0x028000
>> Nov 02 10:41:26 razor kernel: pci 0000:05:00.0: reg 0x10: [mem 0xd2100000-0xd21fffff 64bit]
>> Nov 02 10:41:26 razor kernel: pci 0000:05:00.0: PME# supported from D0 D3hot D3cold
>> Nov 02 10:41:26 razor kernel: pci 0000:05:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0000:00:1c.1 (capable of 7.876 Gb/s with 8.0 GT/s PCIe x1 link)
>> Nov 02 10:41:26 razor kernel: pci 0000:05:00.0: Adding to iommu group 21
>> Nov 02 10:41:27 razor kernel: ath11k_pci 0000:05:00.0: WARNING: ath11k PCI support is experimental!
>> Nov 02 10:41:27 razor kernel: ath11k_pci 0000:05:00.0: BAR 0: assigned [mem 0xd2100000-0xd21fffff 64bit]
>> Nov 02 10:41:27 razor kernel: ath11k_pci 0000:05:00.0: enabling device (0000 -> 0002)
>> Nov 02 10:41:27 razor kernel: mhi 0000:05:00.0: Requested to power ON
>> Nov 02 10:41:27 razor kernel: mhi 0000:05:00.0: Power on setup success
>> Nov 02 10:41:27 razor kernel: ath11k_pci 0000:05:00.0: Respond mem req failed, result: 1, err: 0
>
> This seems to be ath11k_qmi_respond_fw_mem_request(). Why is it failure with error 0? No idea.
>
> What would happen if all the GFP_KERNEL in the file were changed to GFP_DMA32?
>
> I'm thinking the hardware perhaps doesn't like too high physical addresses or something. But if I think correctly, freeing to tail should actually move them towards head. So it's weird.
It depends in which order memory is exposed to MM, which might depend on other factors in some configurations.
This smells like it exposes an existing bug. Can you reproduce also with zone shuffling enabled?