Re: [linus:master] [mm] efa7df3e3b: kernel_BUG_at_include/linux/page_ref.h

From: David Hildenbrand
Date: Mon Jun 03 2024 - 11:43:03 EST


On 03.06.24 17:08, Peter Xu wrote:
On Sat, Jun 01, 2024 at 08:22:21AM +0200, David Hildenbrand wrote:
On 01.06.24 02:59, Yang Shi wrote:
On Fri, May 31, 2024 at 5:01 PM Yang Shi <shy828301@xxxxxxxxx> wrote:

On Fri, May 31, 2024 at 4:25 PM Peter Xu <peterx@xxxxxxxxxx> wrote:

On Fri, May 31, 2024 at 07:46:41PM +0200, David Hildenbrand wrote:
try_grab_folio()->try_get_folio()->folio_ref_try_add_rcu()

Is called (mm-unstable) from:

(1) gup_fast function, here IRQs are disable
(2) gup_hugepte(), possibly problematic
(3) memfd_pin_folios(), possibly problematic
(4) __get_user_pages(), likely problematic

(1) should be fine.

(2) is possibly problematic on the !fast path. If so, due to commit
a12083d721d7 ("mm/gup: handle hugepd for follow_page()") ? CCing Peter.

(3) is possibly wrong. CCing Vivek.

(4) is what we hit here

I guess it was overlooked because try_grab_folio() didn't have any comment
or implication on RCU or IRQ internal helpers being used, hence a bit
confusing. E.g. it has different context requirement on try_grab_page(),
even though they look like sister functions. It might be helpful to have a
better name, something like try_grab_folio_rcu() in this case.

Btw, none of above cases (2-4) have real bug, but we're looking at some way
to avoid triggering the sanity check, am I right? I hope besides the host
splash I didn't overlook any other side effect this issue would cause, and
the splash IIUC should so far be benign, as either gup slow (2,4) or the
newly added memfd_pin_folios() (3) look like to have the refcount stablized
anyway.

Yang's patch in the other email looks sane to me, just that then we'll add
quite some code just to avoid this sanity check in paths 2-4 which seems
like an slight overkill.

One thing I'm thinking is whether folio_ref_try_add_rcu() can get rid of
its RCU limitation. It boils down to whether we can use atomic_add_unless()
on TINY_RCU / UP setup too? I mean, we do plenty of similar things
(get_page_unless_zero, etc.) in generic code and I don't understand why
here we need to treat folio_ref_try_add_rcu() specially.

IOW, the assertions here we added:

VM_BUG_ON(!in_atomic() && !irqs_disabled());

Is because we need atomicity of below sequences:

VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
folio_ref_add(folio, count);

But atomic ops avoids it.

Yeah, I didn't think of why atomic can't do it either. But is it
written in this way because we want to catch the refcount == 0 case
since it means a severe bug? Did we miss something?

Thought more about it and disassembled the code. IIUC, this is an
optimization for non-SMP kernel. When in rcu critical section or irq
is disabled, we just need an atomic add instruction.
folio_ref_add_unless() would yield more instructions, including branch
instruction. But I'm wondering how useful it would be nowadays. Is it
really worth the complexity? AFAIK, for example, ARM64 has not
supported non-SMP kernel for years.

My patch actually replaced all folio_ref_add_unless() to
folio_ref_add() for slow paths, so it is supposed to run faster, but
we are already in slow path, it may be not measurable at all. So
having more simple and readable code may outweigh the potential slight
performance gain in this case?

Yes, we don't want to use atomic RMW that return values where we can use
atomic RMW that don't return values. The former is slower and implies a
memory barrier, that can be optimized out on some arcitectures (arm64 IIRC)

We should clean that up here, and make it clearer that the old function is
only for grabbing a folio if it can be freed concurrently -- GUP-fast.

Note again that this only affects TINY_RCU, which mostly implies
!PREEMPTION and UP. It's a matter of whether we prefer adding these bunch
of code to optimize that.

Also we didn't yet measure that in a real workload and see how that
"unless" plays when buried in other paths, but then we'll need a special
kernel build first, and again I'm not sure whether it'll be worthwhile.

try_get_folio() is all about grabbing a folio that might get freed concurrently. That's why it calls folio_ref_try_add_rcu() and does complicated stuff.

On !CONFIG_TINY_RCU, it performs a folio_ref_add_unless(). That's essentially a atomic_add_unless(), which in the worst case ends up being a cmpxchg loop.


Stating that we should be using try_get_folio() in paths where we are sure the folio refcount is not 0 is the same as using folio_try_get() where folio_get() would be sufficient.

The VM_BUG_ON in folio_ref_try_add_rcu() really tells us here that we are using a function in the wrong context, although in our case, it is safe to use (there is now BUG). Which is true, because we know we have a folio reference and can simply use a simple folio_ref_add().

Again, just like we have folio_get() and folio_try_get(), we should distinguish in GUP whether we are adding more reference to a folio (and effectively do what folio_get() would), or whether we are actually grabbing a folio that could be freed concurrently (what folio_try_get() would do).

--
Cheers,

David / dhildenb