RE: [RFC PATCH v2 0/5] Add hardware prefetch driver for A64FX and Intel processors
From: tarumizu.kohei@xxxxxxxxxxx
Date: Sun Nov 07 2021 - 21:25:06 EST
Hi,
Thanks for your comment.
> This is all fine and dandy but what I'm missing in this pile of text - at least I couldn't
> find it - is why do we need this in the upstream kernel?
>
> Is there some real-life use case that would benefit from software fiddling with
> prefetchers or is this one of those, well, we have those controls, lets expose them
> in the OS?
>
> IOW, you need to sell this stuff properly first - then talk design.
A64FX and some Intel processors has implementation-dependent register
for controlling hardware prefetch. Intel has MSR_MISC_FEATURE_CONTROL,
and A64FX has IMP_PF_STREAM_DETECT_CTRL_EL0. These register cannot be
accessed from userspace, so we provide a proper kernel interface.
The advantage of using this interface from userspace is that we can
expect performance improvements.
The following performance improvements have been reported for some
Intel processors.
https://github.com/xmrig/xmrig/issues/1433#issuecomment-572126184
A64FX also has several applications that have actually been improved
performance. In most of these cases, we are tuning the parameter of
hardware prefetch distance. One of them is the Stream benchmark.
For reference, here is the result of STREAM Triad when tuning with
the dist attribute file in L1 and L2 cache on A64FX.
| dist combination | Pattern A | Pattern B |
|-------------------|-------------|-------------|
| L1:256, L2:1024 | 234505.2144 | 114600.0801 |
| L1:1536, L2:1024 | 279172.8742 | 118979.4542 |
| L1:256, L2:10240 | 247716.7757 | 127364.1533 |
| L1:1536, L2:10240 | 283675.6625 | 125950.6847 |
In pattern A, we set the size of the array to 174720, which is about
half the size of the L1d cache. In pattern B, we set the size of the
array to 10485120, which is about twice the size of the L2 cache.
In pattern A, a change of dist at L1 has a larger effect. On the other
hand, in pattern B, the change of dist at L2 has a larger effect.
As described above, the optimal dist combination depends on the
characteristics of the application. Therefore, such a sysfs interface
is useful for performance tuning.
For these reasons, we would like to add this interface to the
upstream kernel.
> I'm not sure about a wholly separate drivers/hwpf/ - it's not like there are
> gazillion different hw prefetch drivers.
We created a new directory to lump multiple separate files into one
place. We don't think this is a good way. If there is any other
suitable way, we would like to change it.