Re: [PATCH v3 3/3] proc: add kpageidle file

From: Minchan Kim
Date: Sat May 09 2015 - 11:13:05 EST


Hello, Vladimir

On Fri, May 08, 2015 at 12:56:04PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 07:54:59PM +0900, Minchan Kim wrote:
> > So, I guess once below compiler optimization happens in __page_set_anon_rmap,
> > it could be corrupt in page_refernced.
> >
> > __page_set_anon_rmap:
> > page->mapping = (struct address_space *) anon_vma;
> > page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);
> >
> > Because page_referenced checks it with PageAnon which has no memory barrier.
> > So if above compiler optimization happens, page_referenced can pass the anon
> > page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)
>
> FWIW
>
> If such splits were possible, we would have bugs all over the kernel
> IMO. An example is do_wp_page() vs shrink_active_list(). In do_wp_page()
> we can call page_move_anon_rmap(), which sets page->mapping in exactly
> the same fashion as above-mentioned __page_set_anon_rmap():
>
> anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> page->mapping = (struct address_space *) anon_vma;
>
> The page in question may be on an LRU list, because nowhere in
> do_wp_page() we remove it from the list, neither do we take any LRU
> related locks. The page is locked, that's true, but shrink_active_list()
> calls page_referenced() on an unlocked page, so according to your logic
> they can race with the latter receiving a page with page->mapping equal
> to anon_vma w/o PAGE_MAPPING_ANON bit set:
>
> CPU0 CPU1
> ---- ----
> do_wp_page shrink_active_list
> lock_page page_referenced
> PageAnon->yes, so skip trylock_page
> page_move_anon_rmap
> page->mapping = anon_vma
> rmap_walk
> PageAnon->no
> rmap_walk_file
> BUG
> page->mapping = page->mapping+PAGE_MAPPING_ANON
>
> However, this does not happen.

Good spot.

However, it doesn't mean it's right so you are okay to rely on it.
Normally, store tearing is not common and such race would be hard to hit
but I want to call it as BUG.

Rik wrote the code and commented out.

"Protected against the rmap code by the page lock"

But unfortunately, page_referenced in shrink_active_list doesn't hold
a page lock so isn't it a bug? Rik?

Please, read store tearing section in Documentation/memory-barrier.txt.
If you get confused due to aligned memory, please read this link.

https://lkml.org/lkml/2014/7/16/262

Other quote from Paul in https://lkml.org/lkml/2015/5/1/229
"
..
If the thing read/written does fit into a machine word and if the location
read/written is properly aligned, I would be quite surprised if either
READ_ONCE() or WRITE_ONCE() resulted in any sort of tearing.
"

I parsed it as that "even store tearing can happen machine word at
alinged address and that's why WRITE_ONCE is there to prevent it"

If you want to claim GCC doesn't do it, please read below links

https://lkml.org/lkml/2015/4/16/527
http://yarchive.net/comp/linux/ACCESS_ONCE.html

Quote from Linus
"
The thing is, you can't _prove_ that the compiler won't do it, especially
if you end up changing the code later (without thinking about the fact
that you're loading things without locking).

So the rule is: if you access unlocked values, you use ACCESS_ONCE(). You
don't say "but it can't matter". Because you simply don't know.
"

Yeb, I might be paranoid but my point is it might work now on most of
arch but it seem to be buggy/fragile/subtle because we couldn't prove
all arch/compiler don't make any trouble. So, intead of adding more
logics based on fragile, please use right lock model. If lock becomes
big trouble by overhead, let's fix it(for instance, use WRITE_ONCE for
update-side and READ_ONCE for read-side) if I don't miss something.

>
> Thanks,
> Vladimir

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/