(2) break_cow_pte() can fail, which means that we can fail some
operations (possibly silently halfway through) now. For example,
looking at your change_pte_range() change, I suspect it's wrong.
Maybe I should add WARN_ON() and skip the failed COW PTE.
One way or the other we'll have to handle it. WARN_ON() sounds wrong for
handling OOM situations (e.g., if only that cgroup is OOM).
Or we should do the same thing like you mentioned:
"
For example, __split_huge_pmd() is currently not able to report a
failure. I assume that we could sleep in there. And if we're not able to
allocate any memory in there (with sleeping), maybe the process should
be zapped either way by the OOM killer.
"
But instead of zapping the process, we just skip the failed COW PTE.
I don't think the user will expect their process to be killed by
changing the protection.
(3) handle_cow_pte_fault() looks quite complicated and needs quite some
double-checking: we temporarily clear the PMD, to reset it
afterwards. I am not sure if that is correct. For example, what
stops another page fault stumbling over that pmd_none() and
allocating an empty page table? Maybe there are some locking details
missing or they are very subtle such that we better document them. I
recall that THP played quite some tricks to make such cases work ...
I think that holding mmap_write_lock may be enough (I added
mmap_assert_write_locked() in the fault function btw). But, I might
be wrong. I will look at the THP stuff to see how they work. Thanks.
Ehm, but page faults don't hold the mmap lock writable? And so are other
callers, like MADV_DONTNEED or MADV_FREE.
handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should
bail out.
Either I am missing something or you didn't test with lockdep enabled :)
You're right. I thought I enabled the lockdep.
And, why do I have the page fault will handle the mmap lock writable in my mind.
The page fault holds the mmap lock readable instead of writable.
;-)
I should check/test all the locks again.
Thanks.