Re: [PATCH 1/1] mm: protect xa split stuff under lruvec->lru_lock during migration

From: Marcin Wanat
Date: Tue May 21 2024 - 11:47:53 EST


On 21.05.2024 03:00, Zhaoyang Huang wrote:
On Tue, May 21, 2024 at 8:58 AM Zhaoyang Huang <huangzhaoyang@xxxxxxxxx> wrote:

On Tue, May 21, 2024 at 3:42 AM Marcin Wanat <private@xxxxxxxxxxxxxx> wrote:

On 15.04.2024 03:50, Zhaoyang Huang wrote:
I have around 50 hosts handling high I/O (each with 20Gbps+ uplinks
and multiple NVMe drives), running RockyLinux 8/9. The stock RHEL
kernel 8/9 is NOT affected, and the long-term kernel 5.15.X is NOT affected.
However, with long-term kernels 6.1.XX and 6.6.XX,
(tested at least 10 different versions), this lockup always appears
after 2-30 days, similar to the report in the original thread.
The more load (for example, copying a lot of local files while
serving 20Gbps traffic), the higher the chance that the bug will appear.

I haven't been able to reproduce this during synthetic tests,
but it always occurs in production on 6.1.X and 6.6.X within 2-30 days.
If anyone can provide a patch, I can test it on multiple machines
over the next few days.
Could you please try this one which could be applied on 6.6 directly. Thank you!
URL: https://lore.kernel.org/linux-mm/20240412064353.133497-1-zhaoyang.huang@xxxxxxxxxx/


Unfortunately, I am unable to cleanly apply this patch against the latest 6.6.31