Re: [PATCH] mm: migration :shared anonymous migration test is failing

From: Donet Tom
Date: Mon Dec 23 2024 - 07:03:13 EST

Next message: Jonathan Cameron: "Re: [PATCH v2 2/6] iio: adc: ad4030: add driver for ad4030-24"
Previous message: Konrad Dybcio: "Re: [PATCH v7 5/5] arm64: dts: qcom: Add CTCU and ETR nodes for SA8775p"
In reply to: Dev Jain: "Re: [PATCH] mm: migration :shared anonymous migration test is failing"
Next in thread: kernel test robot: "Re: [PATCH] mm: migration :shared anonymous migration test is failing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/20/24 10:07, Dev Jain wrote:

On 20/12/24 9:02 am, Baolin Wang wrote:

On 2024/12/20 11:12, Donet Tom wrote:

On 12/20/24 08:01, Baolin Wang wrote:

On 2024/12/19 20:47, Donet Tom wrote:

The migration selftest is currently failing for shared anonymous
mappings due to a race condition.

During migration, the source folio's PTE is unmapped by nuking the
PTE, flushing the TLB,and then marking the page for migration
(by creating the swap entries). The issue arises when, immediately
after the PTE is nuked and the TLB is flushed, but before the page
is marked for migration, another thread accesses the page. This
triggers a page fault, and the page fault handler invokes
do_pte_missing() instead of do_swap_page(), as the page is not yet
marked for migration.

In the fault handling path, do_pte_missing() calls __do_fault()
->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry().
This eventually calls folio_try_get(), incrementing the reference
count of the folio undergoing migration. The thread then blocks
on folio_lock(), as the migration path holds the lock. This
results in the migration failing in __migrate_folio(), which expects
the folio's reference count to be 2. However, the reference count is
incremented by the fault handler, leading to the failure.

The issue arises because, after nuking the PTE and before marking the
page for migration, the page is accessed. To address this, we have
updated the logic to first nuke the PTE, then mark the page for
migration, and only then flush the TLB. With this patch, If the page is
accessed immediately after nuking the PTE, the TLB entry is still
valid, so no fault occurs. After marking the page for migration,

IMO, I don't think this assumption is correct. At this point, the TLB entry might also be evicted, so a page fault could still occur. It's just a matter of probability.
In this patch, we mark the page for migration before flushing the TLB.
This ensures that if someone accesses the page after the TLB flush,
the page fault will occur and in the page fault handler will wait for the
migration to complete. So migration will not fail

Without this patch, if someone accesses the page after the TLB flush
but before it is marked for migration, the migration will fail.

Actually my concern is the same as David's (I did not see David's reply before sending my comments), which is that your patch does not "rules out all cases".

I like this solution but really the proper solution for this one was to atomically set the migration entry IMHO.

Thank you Dev. I will check and come back on this.

Additionally, IIUC, if another thread is accessing the shmem folio causing the migration to fail, I think this is expected, and migration failure is not a vital issue?

In my case, the shmem migration test is always failing,
even after retries. Would it be correct to consider this
as expected behavior?

IMHO I think your test case is too aggressive and unlikely to occur in real-world scenarios. Additionally, as I mentioned, migration failure is not a vital issue in the system, and some temporary refcnt can also lead to migration failure if you want to create such test cases. So personally, I don't think it is worthy doing.

Agreed, AFAIR the test case starts faulting exactly on those pages which we want to migrate, making this a very artificial scenario.

Next message: Jonathan Cameron: "Re: [PATCH v2 2/6] iio: adc: ad4030: add driver for ad4030-24"
Previous message: Konrad Dybcio: "Re: [PATCH v7 5/5] arm64: dts: qcom: Add CTCU and ETR nodes for SA8775p"
In reply to: Dev Jain: "Re: [PATCH] mm: migration :shared anonymous migration test is failing"
Next in thread: kernel test robot: "Re: [PATCH] mm: migration :shared anonymous migration test is failing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]