Re: [PATCH] efifb: allow user to disable write combined mapping.
From: Dave Airlie
Date: Tue Jul 18 2017 - 16:45:00 EST
On 19 July 2017 at 05:57, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Jul 18, 2017 at 7:34 AM, Peter Jones <pjones@xxxxxxxxxx> wrote:
>>
>> Well, that's kind of amazing, given 3c004b4f7eab239e switched us /to/
>> using ioremap_wc() for the exact same reason. I'm not against letting
>> the user force one way or the other if it helps, though it sure would be
>> nice to know why.
>
> It's kind of amazing for another reason too: how is ioremap_wc()
> _possibly_ slower than ioremap_nocache() (which is what plain
> ioremap() is)?
In normal operation the console is faster with _wc. It's the side effects
on other cores that is the problem.
> Or maybe it really is something where there is one global write queue
> per die (not per CPU), and having that write queue "active" doing
> combining will slow down every core due to some crazy synchronization
> issue?
>
> x86 people, look at what Dave Airlie did, I'll just repeat it because
> it sounds so crazy:
>
>> A customer noticed major slowdowns while logging to the console
>> with write combining enabled, on other tasks running on the same
>> CPU. (10x or greater slow down on all other cores on the same CPU
>> as is doing the logging).
>>
>> I reproduced this on a machine with dual CPUs.
>> Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 core)
>>
>> I wrote a test that just mmaps the pci bar and writes to it in
>> a loop, while this was running in the background one a single
>> core with (taskset -c 1), building a kernel up to init/version.o
>> (taskset -c 8) went from 13s to 133s or so. I've yet to explain
>> why this occurs or what is going wrong I haven't managed to find
>> a perf command that in any way gives insight into this.
>
> So basically the UC vs WC thing seems to slow down somebody *else* (in
> this case a kernel compile) on another core entirely, by a factor of
> 10x. Maybe the WC writer itself is much faster, but _others_ are
> slowed down enormously.
>
> Whaa? That just seems incredible.
Yes I've been staring at this for a while now trying to narrow it down, I've
been a bit slow on testing it on a wider range of Intel CPUs, I've
only really managed
to play on that particular machine,
I've attached two test files. compile both of them (I just used make
write_resource burn-cycles).
On my test CPU core 1/8 are on same die.
time taskset -c 1 ./burn-cycles
takes about 6 seconds
taskset -c 8 ./write_resource wc
taskset -c 1 ./burn-cycles
takes about 1 minute.
Now I've noticed write_resource wc or not wc doesn't seem to make a
difference, so
I think it matters that efifb has used _wc for the memory area already
and set PAT on it for wc,
and we always get wc on that BAR.
>From the other person seeing it:
"I done a similar test some time ago, the result was the same.
I ran some benchmarks, and it seems that when data set fits in L1
cache there is no significant performance degradation."
Dave.
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
int i, j;
char *resname;
if (argc > 1 && !strcmp(argv[1], "wc"))
resname = "/sys/bus/pci/devices/0000:01:00.1/resource0_wc";
else
resname = "/sys/bus/pci/devices/0000:01:00.1/resource0";
int fd = open(resname, O_RDWR);
if (fd == -1)
return -1;
void *ptr = mmap(NULL, 64*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (!ptr)
return -1;
volatile uint32_t *uptr = ptr;
for (j = 0; j < 1024*1024; j++)
for (i = 0; i < 16*1024; i++) {
uptr[i] = 0;
}
munmap(ptr, 64*1024);
close(fd);
}
#include <stdio.h>
#include <stdint.h>
#define SIZE 1024*1024
int main(void) {
volatile int i,j;
volatile uint32_t x[SIZE];
for (j = 0; j < 1000; j++)
for (i = 0; i < SIZE; i++) x[i] = 1;
}