Re: [RFC V2] mm: add the zero case to page[1].compound_nr in set_compound_order

From: Nico Pache
Date: Thu Dec 15 2022 - 16:39:55 EST

Next message: Sam Protsenko: "Re: [PATCH v3 2/4] arm64: dts: exynos: add dedicated SYSREG compatibles to Exynos850"
Previous message: Sam Protsenko: "Re: [PATCH v3 1/4] dt-bindings: soc: samsung: exynos-sysreg: add dedicated SYSREG compatibles to Exynos850"
In reply to: Nico Pache: "Re: [RFC V2] mm: add the zero case to page[1].compound_nr in set_compound_order"
Next in thread: Matthew Wilcox: "Re: [RFC V2] mm: add the zero case to page[1].compound_nr in set_compound_order"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Dec 14, 2022 at 7:48 PM Nico Pache <npache@xxxxxxxxxx> wrote:
>
> On Wed, Dec 14, 2022 at 10:04 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Dec 13, 2022 at 04:45:05PM -0700, Nico Pache wrote:
> > > Since commit 1378a5ee451a ("mm: store compound_nr as well as
> > > compound_order") the page[1].compound_nr must be explicitly set to 0 if
> > > calling set_compound_order(page, 0).
> > >
> > > This can lead to bugs if the caller of set_compound_order(page, 0) forgets
> > > to explicitly set compound_nr=0. An example of this is commit ba9c1201beaa
> > > ("mm/hugetlb: clear compound_nr before freeing gigantic pages")
> > >
> > > Collapse these calls into the set_compound_order by utilizing branchless
> > > bitmaths [1].
> > >
> > > [1] https://graphics.stanford.edu/~seander/bithacks.html#ConditionalSetOrClearBitsWithoutBranching
> > >
> > > V2: slight changes to commit log and remove extra '//' in the comments
> >
> > We don't usually use // comments anywhere in the kernel other than
> > the SPDX header.
>
> Whoops!
>
> > > static inline void set_compound_order(struct page *page, unsigned int order)
> > > {
> > > + unsigned long shift = (1U << order);
> >
> > Shift is a funny name for this variable. order is the shift. this is 'nr'.
>
> Good point! Waiman found an even better/cleaner solution that would
> avoid needing an extra variable.
> page[1].compound_nr = (1U << order) & ~1U;
>
> > > page[1].compound_order = order;
> > > #ifdef CONFIG_64BIT
> > > - page[1].compound_nr = 1U << order;
> > > + // Branchless conditional:
> > > + // order > 0 --> compound_nr = shift
> > > + // order == 0 --> compound_nr = 0
> > > + page[1].compound_nr = shift ^ (-order ^ shift) & shift;
> >
> > Can the compiler see through this? Before, the compiler sees:
> >
> > page[1].compound_order = 0;
> > page[1].compound_nr = 1U << 0;
> > ...
> > page[1].compound_nr = 0;
> >
> > and it can eliminate the first store.
>
> This may be the case at the moment, but with:
> https://lore.kernel.org/linux-mm/20221213212053.106058-1-sidhartha.kumar@xxxxxxxxxx/
> we will have a branch instead. Sidhartha tested it and found no
> regression; the concern is that if THPs get implemented using this
> callpath then we may end up seeing a slowdown.
>
> After doing my analysis below I dont think this is the case for the
> destroy case(at least on x86).
> In the destroy case for both the branch and branchless approach we see
> the compiler optimizing away the bitmath and the branch and setting
> the variable to zero.
> In the prep case we see the introduction of a test and cmovne
> instruction, implying a branch.
>
> > Now the compiler sees:
> > unsigned long shift = (1U << 0);
> > page[1].compound_order = order;
> > page[1].compound_nr = shift ^ (0 ^ shift) & shift;
> >
> > Does it do the maths at compile-time, knowing that order is 0 at this
> > callsite and deducing that it can just store a 0?
> >
> > I think it might, since shift is constant-1,
> >
> > page[1].compound_nr = 1 ^ (0 ^ 1) & 1;
> > -> page[1].compound_nr = 1 ^ 1 & 1;
> > -> page[1].compound_nr = 0 & 1;
> > -> page[1].compound_nr = 0;
> >
> > But you should run it through the compiler and check the assembly
> > output for __destroy_compound_gigantic_page().
>
> Yep it does look like it gets optimized away for the destroy case:
>
> Bitmath Case (destroy)
> ---------------------------------
> Dump of assembler code for function __update_and_free_page:
> ...
> mov %rsi,%rbp //move 2nd arg (page) to rbp
> ...
> movb $0x0,0x51(%rbp) //page[1].compound_order = 0
> movl $0x0,0x5c(%rbp) //page[1].compound_nr = 0
> ...
>
> Math for movl : 0x5c (92) - 64 (sizeof page[0]) = 28
> pahole page: unsigned int compound_nr; /* 28 4 */
>
> Bitmath Case (prep)
> ---------------------------------
> In the case of prep_compound_gigantic_page the bitmath is being computed
> 0xffffffff8134f17d <+13>: mov %rdi,%r12
> 0xffffffff8134f180 <+16>: push %rbp
> 0xffffffff8134f181 <+17>: mov $0x1,%ebp
> 0xffffffff8134f186 <+22>: shl %cl,%ebp
> 0xffffffff8134f188 <+24>: neg %ecx
> 0xffffffff8134f18a <+26>: push %rbx
> 0xffffffff8134f18b <+27>: and %ebp,%ecx
> 0xffffffff8134f18d <+29>: mov %sil,0x51(%rdi)
> 0xffffffff8134f191 <+33>: mov %ecx,0x5c(%rdi) //set page[1].compound_nr
>
> Now to break down the approach with the branch:
>
> Branch Case (destroy)
> ---------------------------------
> No branch utilized to determine the following instructions.
> 0xffffffff813507bc <+236>: movb $0x0,0x51(%rbp)
> 0xffffffff813507c0 <+240>: movl $0x0,0x5c(%rbp)
>
> Branch Case (prep)
> ---------------------------------
> The branch is being computed with the introduction of a cmovne instruction.
> 0xffffffff8134f15d <+13>: mov %rdi,%r12
> 0xffffffff8134f160 <+16>: push %rbp
> 0xffffffff8134f161 <+17>: mov $0x1,%ebp
> 0xffffffff8134f166 <+22>: shl %cl,%ebp
> 0xffffffff8134f168 <+24>: test %esi,%esi //test
> 0xffffffff8134f16a <+26>: push %rbx
> 0xffffffff8134f16b <+27>: cmovne %ebp,%ecx //branch evaluation
> 0xffffffff8134f16e <+30>: mov %sil,0x51(%rdi)
> 0xffffffff8134f172 <+34>: mov %ecx,0x5c(%rdi)
>
To expand a little more on the analysis:
I computed the latency/throughput between <+24> and <+27> using
intel's manual (APPENDIX D):

The bitmath solutions shows a total latency of 2.5 with a Throughput of 0.5.
The branch solution show a total latency of 4 and throughput of 1.5.

Given this is not a tight loop, and the next instruction is requiring
the data computed, better (lower) latency is the more ideal situation.

Just wanted to add that little piece :)
-- Nico

> So it looks like in the destruction of compound pages we'll see no
> gain or loss between the bitmath or branch approach.
> However, in the prep case we may see some performance loss once/if THP
> utilizes this path due to the branch and the loss of CPU
> parallelization that can be achieved utilizing the bitmath approach.
>
> Cheers,
> -- Nico

Next message: Sam Protsenko: "Re: [PATCH v3 2/4] arm64: dts: exynos: add dedicated SYSREG compatibles to Exynos850"
Previous message: Sam Protsenko: "Re: [PATCH v3 1/4] dt-bindings: soc: samsung: exynos-sysreg: add dedicated SYSREG compatibles to Exynos850"
In reply to: Nico Pache: "Re: [RFC V2] mm: add the zero case to page[1].compound_nr in set_compound_order"
Next in thread: Matthew Wilcox: "Re: [RFC V2] mm: add the zero case to page[1].compound_nr in set_compound_order"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]