Re: Endless calls to xas_split_alloc() due to corrupted xarray entry

From: Matthew Wilcox
Date: Wed Jun 19 2024 - 10:31:37 EST

Next message: Jason Gunthorpe: "Re: [PATCH v3 08/21] media: nvidia: tegra: Use iommu_paging_domain_alloc()"
Previous message: Krishna Chaitanya Chundru: "Re: [PATCH v14 3/4] PCI: Bring the PCIe speed to MBps logic to new pcie_link_speed_to_mbps()"
In reply to: David Hildenbrand: "Re: Endless calls to xas_split_alloc() due to corrupted xarray entry"
Next in thread: Linus Torvalds: "Re: Endless calls to xas_split_alloc() due to corrupted xarray entry"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jun 19, 2024 at 11:45:22AM +0200, David Hildenbrand wrote:
> I recall talking to Willy at some point about the problem of order-13 not
> being fully supported by the pagecache right now (IIRC primiarly splitting,
> which should not happen for hugetlb, which is why there it is not a
> problem). And I think we discussed just blocking that for now.
>
> So we are trying to split an order-13 entry, because we ended up
> allcoating+mapping an order-13 folio previously.
>
> That's where things got wrong, with the current limitations, maybe?
>
> #define MAX_PAGECACHE_ORDER HPAGE_PMD_ORDER
>
> Which would translate to MAX_PAGECACHE_ORDER=13 on aarch64 with 64k.
>
> Staring at xas_split_alloc:
>
> WARN_ON(xas->xa_shift + 2 * XA_CHUNK_SHIFT < order)
>
> I suspect we don't really support THP on systems with CONFIG_BASE_SMALL.
> So we can assume XA_CHUNK_SHIFT == 6.
>
> I guess that the maximum order we support for splitting is 12? I got confused
> trying to figure that out. ;)

Actually, it's 11. We can't split an order-12 folio because we'd have
to allocate two levels of radix tree, and I decided that was too much
work. Also, I didn't know that ARM used order-13 PMD size at the time.

I think this is the best fix (modulo s/12/11/). Zi Yan and I discussed
improving split_folio() so that it doesn't need to split the entire
folio to order-N. But that's for the future, and this is the right fix
for now.

For the interested, when we say "I need to split", usually, we mean "I
need to split _this_ part of the folio to order-N", and we're quite
happy to leave the rest of the folio as intact as possible. If we do
that, then splitting from order-13 to order-0 becomes quite a tractable
task, since we only need to allocate 2 radix tree nodes, not 65.

/**
* folio_split - Split a smaller folio out of a larger folio.
* @folio: The containing folio.
* @page_nr: The page offset within the folio.
* @order: The order of the folio to return.
*
* Splits a folio of order @order from the containing folio.
* Will contain the page specified by @page_nr, but that page
* may not be the first page in the returned folio.
*
* Context: Caller must hold a reference on @folio and has the folio
* locked. The returned folio will be locked and have an elevated
* refcount; all other folios created by splitting the containing
* folio will be unlocked and not have an elevated refcount.
*/
struct folio *folio_split(struct folio *folio, unsigned long page_nr,
unsiged int order);

> I think this does not apply to hugetlb because we never end up splitting
> entries. But could this also apply to shmem + PMD THP?

Urgh, good point. We need to make that fail on arm64 with 64KB page
size. Fortunately, it almost always failed anyway; it's really hard to
allocate 512MB pages.

Next message: Jason Gunthorpe: "Re: [PATCH v3 08/21] media: nvidia: tegra: Use iommu_paging_domain_alloc()"
Previous message: Krishna Chaitanya Chundru: "Re: [PATCH v14 3/4] PCI: Bring the PCIe speed to MBps logic to new pcie_link_speed_to_mbps()"
In reply to: David Hildenbrand: "Re: Endless calls to xas_split_alloc() due to corrupted xarray entry"
Next in thread: Linus Torvalds: "Re: Endless calls to xas_split_alloc() due to corrupted xarray entry"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]