Re: [mm 4.15-rc8] Random oopses under memory pressure.
From: Kirill A. Shutemov
Date: Thu Jan 18 2018 - 18:50:16 EST
On Thu, Jan 18, 2018 at 09:26:25AM -0800, Linus Torvalds wrote:
> On Thu, Jan 18, 2018 at 8:56 AM, Kirill A. Shutemov
> <kirill@xxxxxxxxxxxxx> wrote:
> >
> > I can't say I fully grasp how 'diff' got this value and how it leads to both
> > checks being false.
>
> I think the problem is that page difference when they are in different sections.
>
> When you do
>
> pte_page(*pvmw->pte) - pvmw->page
>
> then the compiler takes the pointer difference, and then divides by
> the size of "struct page" to get an index.
>
> But - and this is important - it does so knowing that the division it
> does will have no modulus: the two 'struct page *' pointers are really
> in the same array, and they really are 'n*sizeof(struct page)' apart
> for some 'n'.
>
> That means that the compiler can optimize the division. In fact, for
> this case, gcc will generate
>
> subl %ebx, %eax
> sarl $3, %eax
> imull $-858993459, %eax, %eax
>
> because 'struct page' is 40 bytes in size, and that magic sequence
> happens to divide by 40 (first divide by 8, then that magical "imull"
> will divide by 5 *IFF* the thing is evenly divisible by 5 (and not too
> big - but the shift guarantees that).
>
> Basically, it's a magic trick, because real divides are very
> expensive, but you can fake them more quickly if you can limit the
> input domain.
>
> But what does it mean if the two "struct page *" are not in the same
> array, and the two arrays were allocated not aligned exactly 40 bytes
> away, but some random number of pages away?
>
> You get *COMPLETE*GARBAGE* when you do the above optimized divide.
> Suddenly the divide had a modulus (because the base of the two arrays
> weren't 40-byte aligned), and the "trick" doesn't work.
>
> So that's why you can't do pointer diffs between two arrays. Not
> because you can't subtract the two pointers, but because the
> *division* part of the C pointer diff rules leads to issues.
Thanks a lot for the explanation!
I wounder if this may be a problem in other places?
For instance, perf uses address of a mutex to determinate the lock
ordering. See mutex_lock_double(). The mutex is embedded into struct
perf_event_context, which is allocated with kzalloc() so I don't see how
we can presume that alignment is consistent between them.
I don't think it's the only example in kernel. Are we just lucky?
--
Kirill A. Shutemov