Re: Prezeroing V2 [0/3]: Why and When it works

From: Paul Mackerras
Date: Thu Dec 23 2004 - 18:05:08 EST

Next message: Bruce Allan: "[PATCH] [resend] VFS locking errors on max offset edge cases"
Previous message: Andrew Morton: "Re: [PATCH] AB-BA deadlock between uidhash_lock and tasklist_lock."
In reply to: Andrew Morton: "Re: Prezeroing V2 [0/3]: Why and When it works"
Next in thread: Linus Torvalds: "Re: Prezeroing V2 [0/3]: Why and When it works"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Andrew Morton writes:

> When the workload is a gcc run, the pagefault handler dominates the system
> time. That's the page zeroing.

For a program which uses a lot of heap and doesn't fork, that sounds
reasonable.

> x86's movnta instructions provide a way of initialising memory without
> trashing the caches and it has pretty good bandwidth, I believe. We should
> wire that up to these patches and see if it speeds things up.

Yes. I don't know the movnta instruction, but surely, whatever scheme
is used, there has to be a snoop for every cache line's worth of
memory that is zeroed.

The other point is that having the page hot in the cache may well be a
benefit to the program. Using any sort of cache-bypassing zeroing
might not actually make things faster, when the user time as well as
the system time is taken into account.

> > I did some measurements once on my G5 powermac (running a ppc64 linux
> > kernel) of how long clear_page takes, and it only takes 96ns for a 4kB
> > page.
>
> 40GB/s. Is that straight into L1 or does the measurement include writeback?

It is the average elapsed time in clear_page, so it would include the
writeback of any cache lines displaced by the zeroing, but not the
writeback of the newly-zeroed cache lines (which we hope will be
modified by the program before they get written back anyway).

This is using the dcbz (data cache block zero) instruction, which
establishes a cache line in modified state with zero contents without
any memory traffic other than a cache line kill transaction sent to
the other CPUs and possible writeback of a dirty cache line displaced
by the newly-zeroed cache line. The new cache line is established in
the L2 cache, because the L1 is write-through on the G5, and all
stores and dcbz instructions have to go to the L2 cache.

Thus, on the G5 (and POWER4, which is similar) I don't think there
will be much if any benefit from having pre-zeroed cache-cold pages.
We can establish the zero lines in cache much faster using dcbz than
we can by reading them in from main memory. If the program uses only
a few cache lines out of each new page, then reading them from memory
might be faster, but that seems unlikely.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Bruce Allan: "[PATCH] [resend] VFS locking errors on max offset edge cases"
Previous message: Andrew Morton: "Re: [PATCH] AB-BA deadlock between uidhash_lock and tasklist_lock."
In reply to: Andrew Morton: "Re: Prezeroing V2 [0/3]: Why and When it works"
Next in thread: Linus Torvalds: "Re: Prezeroing V2 [0/3]: Why and When it works"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]