Re: [PATCH] mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

From: Mel Gorman
Date: Sun Feb 16 2020 - 18:32:53 EST


On Sun, Feb 16, 2020 at 02:18:00PM -0500, Rafael Aquini wrote:
> From: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> A user reported a bug against a distribution kernel while running
> a proprietary workload described as "memory intensive that is not
> swapping" that is expected to apply to mainline kernels. The workload
> is read/write/modifying ranges of memory and checking the contents. They
> reported that within a few hours that a bad PMD would be reported followed
> by a memory corruption where expected data was all zeros. A partial report
> of the bad PMD looked like
>
> [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> [ 5195.341184] ------------[ cut here ]------------
> [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> ....
> [ 5195.410033] Call Trace:
> [ 5195.410471] [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> [ 5195.410716] [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> [ 5195.410918] [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> [ 5195.411200] [<ffffffff81098322>] task_work_run+0x72/0x90
> [ 5195.411246] [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> [ 5195.411494] [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> [ 5195.411739] [<ffffffff815e56af>] retint_user+0x8/0x10
>
> Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> was a false detection. The bug does not trigger if automatic NUMA balancing
> or transparent huge pages is disabled.
>
> The bug is due a race in change_pmd_range between a pmd_trans_huge and
> pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> check, a parallel protection update under lock can have cleared the PMD
> and filled it with a prot_numa entry between the transhuge check and the
> pmd_none_or_clear_bad check.
>
> While this could be fixed with heavy locking, it's only necessary to
> make a copy of the PMD on the stack during change_pmd_range and avoid
> races. A new helper is created for this as the check if quite subtle and the
> existing similar helpful is not suitable. This passed 154 hours of testing
> (usually triggers between 20 minutes and 24 hours) without detecting bad
> PMDs or corruption. A basic test of an autonuma-intensive workload showed
> no significant change in behaviour.
>
> Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
>
> host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
> ....
> host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
> host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
>
> The issue is observed on a v4.18-based distribution kernel, but the race window is
> expected to be applicable to mainline kernels, as well.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Rafael Aquini <aquini@xxxxxxxxxx>

It's curious that it took so long for this to be caught again.
Unfortunately I cannot find exactly what it's racing against but maybe
it's not worth chasing down and the patch is simply the safer option :(

--
Mel Gorman
SUSE Labs