Re: Simplify load_unaligned_zeropad() (was Re: [GIT PULL] Ceph updates for 5.20-rc1)
From: Kirill A. Shutemov
Date: Sun Aug 14 2022 - 18:51:52 EST
On Sun, Aug 14, 2022 at 02:14:08PM -0700, Linus Torvalds wrote:
> On Sun, Aug 14, 2022 at 1:29 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > I dug into it some more, and it is really "load_unaligned_zeropad()"
> > that makes clang really uncomfortable.
> >
> > The problem ends up being that clang sees that it's inside that inner
> > loop, and tries very hard to optimize the shift-and-mask that happens
> > if the exception happens.
> >
> > The fact that the exception *never* happens unless DEBUG_PAGEALLOC is
> > enabled - and very very seldom even then - is not something we can
> > really explain to clang.
>
> Hmm.
>
> The solution to that is actually to move the 'zeropad' part into the
> exception handler.
>
> That makes both gcc and clang generate quite nice code for this all.
> But perhaps equally importantly, it actually simplifies the code
> noticeably:
>
> arch/x86/include/asm/extable_fixup_types.h | 2 ++
> arch/x86/include/asm/word-at-a-time.h | 50 +++---------------------------
> arch/x86/mm/extable.c | 30 ++++++++++++++++++
> 3 files changed, 37 insertions(+), 45 deletions(-)
>
> and that's with 11 of those 37 new lines being a new block comment.
>
> It's mainly because now there's no need to worry about
> CONFIG_CC_HAS_ASM_GOTO_OUTPUT in load_unaligned_zeropad(), because the
> asm becomes a simple "address in, data out" thing.
>
> And to make the fixup code simple and straightforward, I just made
> "load_unaligned_zeropad()" use fixed registers for address and output,
> which doesn't bother the compilers at all, they'll happily adjust
> their register allocation. The code generation ends up much better
> than with the goto and the subtle address games that never trigger
> anyway.
>
> PeterZ - you've touched both the load_unaligned_zeropad() and the
> exception code last, so let's run this past you, but this really does
> seem to not only fix the code generation issue in fs/dcache.s, it just
> looks simpler too. Comments?
A bit of off-topic, but..
Recently, I dealt with few[1][2] issues in TDX guest that happens due to
unwanted loads generated by load_unaligned_zeropad(). They are tricky to
follow and I believe that most of developers are not aware of such sneaky
accesses (I was not, until stepped on it).
Do we gain enough benefit from the microoptimization to justify existing
load_unaligned_zeropad()? The helper has rather confusing side-effects.
[1] 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page")
[2] https://lore.kernel.org/lkml/20220614120231.48165-11-kirill.shutemov@xxxxxxxxxxxxxxx/
--
Kiryl Shutsemau / Kirill A. Shutemov