Re: [v3 0/9] parallelized "struct page" zeroing

From: Michal Hocko
Date: Thu Jun 01 2017 - 04:46:17 EST


On Wed 31-05-17 23:35:48, Pasha Tatashin wrote:
> >OK, so why cannot we make zero_struct_page 8x 8B stores, other arches
> >would do memset. You said it would be slower but would that be
> >measurable? I am sorry to be so persistent here but I would be really
> >happier if this didn't depend on the deferred initialization. If this is
> >absolutely a no-go then I can live with that of course.
>
> Hi Michal,
>
> This is actually a very good idea. I just did some measurements, and it
> looks like performance is very good.
>
> Here is data from SPARC-M7 with 3312G memory with single thread performance:
>
> Current:
> memset() in memblock allocator takes: 8.83s
> __init_single_page() take: 8.63s
>
> Option 1:
> memset() in __init_single_page() takes: 61.09s (as we discussed because of
> membar overhead, memset should really be optimized to do STBI only when size
> is 1 page or bigger).
>
> Option 2:
>
> 8 stores (stx) in __init_single_page(): 8.525s!
>
> So, even for single thread performance we can double the initialization
> speed of "struct page" on SPARC by removing memset() from memblock, and
> using 8 stx in __init_single_page(). It appears we never miss L1 in
> __init_single_page() after the initial 8 stx.

OK, that is good to hear and it actually matches my understanding that
writes to a single cacheline should add an overhead.

Thanks!
--
Michal Hocko
SUSE Labs