Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades

From: Alistair Popple
Date: Tue May 23 2023 - 00:36:17 EST

Next message: Yonghong Song: "Re:"
Previous message: Yonghong Song: "Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in __bpf_prog_put()"
In reply to: John Hubbard: "Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades"
Next in thread: John Hubbard: "Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

John Hubbard <jhubbard@xxxxxxxxxx> writes:

> On 5/22/23 16:50, Alistair Popple wrote:
> ...
>>> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
>>> is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty
>>> much a perfect fit for what you want to achieve.
>> Right, I didn't take that approach because it doesn't allow an event
>> type to be passed which would allow them to be filtered on platforms
>> which don't require this.
>> I had also assumed the invalidate_range() callbacks were allowed to
>> sleep, hence couldn't be called under PTL. That's certainly true of mmu
>> interval notifier callbacks, but Catalin reminded me that calls such as
>> ptep_clear_flush_notify() already call invalidate_range() callback under
>> PTL so I guess we already assume drivers don't sleep in their
>> invalidate_range() callbacks. I will update the comments to reflect
>
> This line of reasoning feels very fragile. The range notifiers generally
> do allow sleeping. They are using srcu (sleepable RCU) protection, btw.

Regardless of how well documented this is or isn't (it isn't currently,
but it used to be) it certainly seems to be a well established rule that
the .invalidate_range() callback cannot sleep. The vast majority of
callers do call this holding the PTL, and comments make it explicit that
this is somewhat expected:

Eg: In rmap.c:

* No need to call mmu_notifier_invalidate_range() it has be
* done above for all cases requiring it to happen under page
* table lock before mmu_notifier_invalidate_range_end()

> The fact that existing callers are calling these under PTL just means
> that so far, that has sorta worked. And yes, we can probably make this
> all work. That's not really the ideal way to deduce the API rules, though,
> and it would be good to clarify what they really are.

Of course not. I will update the documentation to clarify this, but see
below for some history which clarifies how we got here.

> Aside from those use cases, I don't see anything justifying a "not allowed
> to sleep" rule for .invalidate_range(), right?

Except that "those use cases" are approximately all of the use cases
AFAICT, and as it turns out this was actually a rule when
.invalidate_range() was added.

Commit 0f0a327fa12c ("mmu_notifier: add the callback for
mmu_notifier_invalidate_range()") included this in the documentation:

* The invalidate_range() function is called under the ptl
* spin-lock and not allowed to sleep.

This was later removed in 5ff7091f5a2c ("mm, mmu_notifier: annotate mmu
notifiers with blockable invalidate callbacks") which introduced the
MMU_INVALIDATE_DOES_NOT_BLOCK flag:

* If this [invalidate_range()] callback cannot block, and invalidate_range_{start,end}
* cannot block, mmu_notifier_ops.flags should have
* MMU_INVALIDATE_DOES_NOT_BLOCK set.

However the removal of the original comment seems wrong -
invalidate_range() was still getting called under the ptl and therefore
could not block regardless of if MMU_INVALIDATE_DOES_NOT_BLOCK was set
or not.

Of course the flag and related documentation was removed shortly after
by 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
notifiers") and 4e15a073a168 ("Revert "mm, mmu_notifier: annotate mmu
notifiers with blockable invalidate callbacks"")

None of those changes actually made it safe for .invalidate_range()
callbacks to sleep, nor was that their goal. They were all about making
sure it was ok for .invalidate_range_start() to sleep AFAICT.

So I think it's perfectly fine to require .invalidate_range() callbacks
to be non-blocking, and if they are that's a driver bug. Note that this
isn't talking about mmu *interval* notifiers - they are slightly
different and don't hook into the mmu_notifier_invalidate_range() call.
They use start()/end() and as such are allowed to sleep.

- Alistair

> thanks,

Next message: Yonghong Song: "Re:"
Previous message: Yonghong Song: "Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in __bpf_prog_put()"
In reply to: John Hubbard: "Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades"
Next in thread: John Hubbard: "Re: [PATCH] mmu_notifiers: Notify on pte permission upgrades"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]