RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact from HWprefetch

From: Ma, Ling
Date: Fri Jul 01 2011 - 06:26:45 EST


Sorry for incorrect copy_page_c results from movsb not movsq.

Update results :
(the benchmark is not enough accurate, but it could tell us which is faster)

1. We copy 4096 bytes for 32 times on snb, and extract minimum execution time

On hot cache case:
Copy_page copy_page_c copy_page_sse2 without preftch (128bit write /cycle) copy_page_sse2 with prefetch (128bit write /cycle)
437 cycles 226 cycles 183 208


2. the same routine with hot-caches, but before each execution we copy
512k data to push original data out of L1 &L2.
On cold cache case:

copy_page(with prefetch) copy_page(without prefetch) copy_page_c copy_page_sse2 without preftch (128bit write /cycle) copy_page_sse2 with prefetch(128bit write /cycle)
688~713 847~860 636~648 661~673 609~615

Answer to the question from Ingo, copy_page_c is always faster to copy page,
but copy_page_c doesn't use prefetch for cold-cache cases, and append prefetch according to copy size.

Thanks
Ling




> -----Original Message-----
> From: Ma, Ling
> Sent: Friday, July 01, 2011 4:11 PM
> To: Ma, Ling; 'Ingo Molnar'; 'Andi Kleen'
> Cc: 'hpa@xxxxxxxxx'; 'tglx@xxxxxxxxxxxxx'; 'linux-
> kernel@xxxxxxxxxxxxxxx'
> Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> from HW prefetch
>
> Forget to append experiment data:
>
> 1. We copy 4096 bytes for 32 times on snb, and extract minimum
> execution time
> On hot cache case:
> Copy_page copy_page_c
> 482 cycles 350 cycles
>
> 2. the same routine with hot-caches, but before each execution we copy
> 512k data to push original data out of L1 &L2.
> On cold cache case:
> copy_page(with prefetch) copy_page(without prefetch)
> copy_page_c
> 853~873 cycles 1037~1051 cycles 959~976
> cycles
>
> Thanks
> Ling
>
> > -----Original Message-----
> > From: Ma, Ling
> > Sent: Tuesday, June 28, 2011 11:24 PM
> > To: 'Ingo Molnar'; Andi Kleen
> > Cc: hpa@xxxxxxxxx; tglx@xxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> > Subject: RE: [PATCH RFC] [x86] Optimize copy-page by reducing impact
> > from HW prefetch
> >
> > Hi Ingo
> >
> > > Ling, mind double checking which one is the faster/better one on
> SNB,
> > > in cold-cache and hot-cache situations, copy_page or copy_page_c?
> > Copy_page_c
> > on hot-cache copy_page_c on SNB combines data to 128bit (processor
> > limit 128bit/cycle for write) after startup latency
> > so it is faster than copy_page which provides 64bit/cycle for write.
> >
> > on cold-cache copy_page_c doesn't use prefetch, which uses prfetch
> > according to copy size,
> > so copy_page function is better.
> >
> > Thanks
> > Ling

Attachment: snb_info
Description: snb_info