Re: [PATCH 00/25] Page folios
From: David Hildenbrand
Date: Thu Dec 17 2020 - 09:38:57 EST
On 17.12.20 14:55, Matthew Wilcox wrote:
> On Thu, Dec 17, 2020 at 01:47:57PM +0100, David Hildenbrand wrote:
>> On 16.12.20 19:23, Matthew Wilcox (Oracle) wrote:
>>> One of the great things about compound pages is that when you try to
>>> do various operations on a tail page, it redirects to the head page and
>>> everything Just Works. One of the awful things is how much we pay for
>>> that simplicity. Here's an example, end_page_writeback():
>>>
>>> if (PageReclaim(page)) {
>>> ClearPageReclaim(page);
>>> rotate_reclaimable_page(page);
>>> }
>>> get_page(page);
>>> if (!test_clear_page_writeback(page))
>>> BUG();
>>>
>>> smp_mb__after_atomic();
>>> wake_up_page(page, PG_writeback);
>>> put_page(page);
>>>
>>> That all looks very straightforward, but if you dive into the disassembly,
>>> you see that there are four calls to compound_head() in this function
>>> (PageReclaim(), ClearPageReclaim(), get_page() and put_page()). It's
>>> all for nothing, because if anyone does call this routine with a tail
>>> page, wake_up_page() will VM_BUG_ON_PGFLAGS(PageTail(page), page).
>>>
>>> I'm not really a CPU person, but I imagine there's some kind of dependency
>>> here that sucks too:
>>>
>>> 1fd7: 48 8b 57 08 mov 0x8(%rdi),%rdx
>>> 1fdb: 48 8d 42 ff lea -0x1(%rdx),%rax
>>> 1fdf: 83 e2 01 and $0x1,%edx
>>> 1fe2: 48 0f 44 c7 cmove %rdi,%rax
>>> 1fe6: f0 80 60 02 fb lock andb $0xfb,0x2(%rax)
>>>
>>> Sure, it's going to be cache hot, but that cmove has to execute before
>>> the lock andb.
>>>
>>> I would like to introduce a new concept that I call a Page Folio.
>>> Or just struct folio to its friends. Here it is,
>>> struct folio {
>>> struct page page;
>>> };
>>>
>>> A folio is a struct page which is guaranteed not to be a tail page.
>>> So it's either a head page or a base (order-0) page. That means
>>> we don't have to call compound_head() on it and we save massively.
>>> end_page_writeback() reduces from four calls to compound_head() to just
>>> one (at the beginning of the function) and it shrinks from 213 bytes
>>> to 126 bytes (using distro kernel config options). I think even that one
>>> can be eliminated, but I'm going slowly at this point and taking the
>>> safe route of transforming a random struct page pointer into a struct
>>> folio pointer by calling page_folio(). By the end of this exercise,
>>> end_page_writeback() will become end_folio_writeback().
>>>
>>> This is going to be a ton of work, and massively disruptive. It'll touch
>>> every filesystem, and a good few device drivers! But I think it's worth
>>> it. Not every routine benefits as much as end_page_writeback(), but it
>>> makes everything a little better. At 29 bytes per call to lock_page(),
>>> unlock_page(), put_page() and get_page(), that's on the order of 60kB of
>>> text for allyesconfig. More when you add on all the PageFoo() calls.
>>> With the small amount of work I've done here, mm/filemap.o shrinks its
>>> text segment by over a kilobyte from 33687 to 32318 bytes (and also 192
>>> bytes of data).
>>
>> Just wondering, as the primary motivation here is "minimizing CPU work",
>> did you run any benchmarks that revealed a visible performance improvement?
>>
>> Otherwise, we're left with a concept that's hard to grasp first (folio -
>> what?!) and "a ton of work, and massively disruptive", saving some kb of
>> code - which does not sound too appealing to me.
>>
>> (I like the idea of abstracting which pages are actually worth looking
>> at directly instead of going via a tail page - tail pages act somewhat
>> like a proxy for the head page when accessing flags)
>
> My primary motivation here isn't minimising CPU work at all. It's trying
Ah, okay, reading about disassembly gave me that impression.
> to document which interfaces are expected to operate on an entire
> compound page and which are expected to operate on a PAGE_SIZE page.
> Today, we have a horrible mishmash of
>
> - This is a head page, I shall operate on 2MB of data
> - This is a tail page, I shall operate on 2MB of data
> - This is not a head page, I shall operate on 4kB of data
> - This is a head page, I shall operate on 4kB of data
> - This is a head|tail page, I shall operate on the size of the compound page.
>
> You might say "Well, why not lead with that?", but I don't know which
> advantages people are going to find most compelling. Even if someone
> doesn't believe in the advantages of using folios in the page cache,
> looking at the assembler output is, I think, compelling.
Personally, I think the implicit documentation of which type of pages
functions expect is a clear advantage. Having less code is a nice cherry
on top.
--
Thanks,
David / dhildenb