Re: [patch 8/8] Immediate Values - Documentation

From: Randy Dunlap
Date: Thu Sep 06 2007 - 17:22:20 EST


On Thu, 06 Sep 2007 16:02:36 -0400 Mathieu Desnoyers wrote:

> Documentation/immediate.txt | 232 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 232 insertions(+)
>
> Index: linux-2.6-lttng/Documentation/immediate.txt
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6-lttng/Documentation/immediate.txt 2007-08-20 15:55:26.000000000 -0400
> @@ -0,0 +1,232 @@
> + Using the Immediate Values
> +
> + Mathieu Desnoyers
> +
> +
> +This document introduces Immediate Values and their use.
> +
> +* Purpose of immediate values
> +
> +An immediate value is used to compile into the kernel variables that sits within

s/sits/sit/

> +the instruction stream. They are meant to be rarely updated but read often.
> +Using immediate values for these variables will save cache lines.
> +
> +This infrastructure is specialized in supporting dynamic patching of the values
> +in the instruction stream when multiple CPUs are running without disturbing the
> +normal system behavior.
> +
> +Compiling code meant to be rarely enabled at runtime can be done using
> +immediate_if() as condition surrounding the code.
> +
> +* Usage
> +
> +In order to use the macro immediate, you should include linux/immediate.h.

"immediate" macros,

> +#include <linux/immediate.h>
> +
> +immediate_char_t this_immediate;
> +EXPORT_SYMBOL(this_immediate);
> +
> +
> +Add, in your code :

And, (?)

> +Use immediate_set(&this_immediate) to set the immediate value.
> +
> +Use immediate_read(&this_immediate) to read the immediate value.
> +
> +The immediate mechanism supports inserting multiple instances of the same
> +immediate. Immediate values can be put in inline functions, inlined static
> +functions, and unrolled loops.
> +
> +If you have to read the immediate values from a function declared as __init or
> +__exit, you should explicitly use _immediate_read(), which will fall back on a
> +global variable read. Failing to do so will leave a reference to the __init
> +section after it is freed (it would generate a modpost warning).
> +
> +The prefered idiom to dynamically enable compiled-in code is to use

preferred

> +immediate_if (&this_immediate), which may eventually use gcc improvements to
> +provide a jump instruction patching based condition instead of a immediate value

of an

> +feeding a conditional jump. You should use _immediate_if () instead of
> +immediate_if () in functions marked __init or __exit.
> +
> +immediate_set_early() should be used only at early kernel boot time, before SMP
> +is activated.

More explanation of immediate_set_early() would be good, such as
What? Why? How?

> +
> +If you need to declare your own immediate types (for instance, a pointer to
> +struct task_struct), use:
> +
> +DEFINE_IMMEDIATE_TYPE(struct task_struct*, immediate_task_struct_ptr_t);
> +
> +and declare your variable with:
> +immediate_task_struct_ptr_t myptr;
> +
> +You can choose to set an initial static value to the immediate by using, for
> +instance:
> +
> +immediate_task_struct_ptr_t myptr = IMMEDIATE_INIT(10);
> +
> +
> +* Optimization for a given architecture
> +
> +One can implement optimized immediate values for a given architecture by
> +replacing asm-$ARCH/immediate.h.
> +
> +* Performance improvement
> +
> +* Memory hit for a data-based branch
> +
> +Here are the results on a 3GHz Pentium 4:
> +
> +number of tests : 100
> +number of branches per test : 100000
> +memory hit cycles per iteration (mean) : 636.611
> +L1 cache hit cycles per iteration (mean) : 89.6413
> +instruction stream based test, cycles per iteration (mean) : 85.3438
> +Just getting the pointer from a modulo on a pseudo-random value, doing
> + noting with it, cycles per iteration (mean) : 77.5044

nothing

> +
> +So:
> +Base case: 77.50 cycles
> +instruction stream based test: +7.8394 cycles
> +L1 cache hit based test: +12.1369 cycles
> +Memory load based test: +559.1066 cycles
> +
> +So let's say we have a ping flood coming at
> +(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
> +7674 packets per second. If we put 2 markers for irq entry/exit, it
> +brings us to 15348 markers sites executed per second.
> +
> +(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
> +We therefore have a 0.29% slowdown just on this case.
> +
> +Compared to this, the instruction stream based test will cause a
> +slowdown of:
> +
> +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
> +For a 0.004% slowdown.
> +
> +If we plan to use this for memory allocation, spinlock, and all sort of
> +very high event rate tracing, we can assume it will execute 10 to 100
> +times more sites per second, which brings us to 0.4% slowdown with the
> +instruction stream based test compared to 29% slowdown with the memory
> +load based test on a system with high memory pressure.
> +
> +
> +
> +* Markers impact under heavy memory load
> +
> +Running a kernel with my LTTng instrumentation set, in a test that
> +generates memory pressure (from userspace) by trashing L1 and L2 caches
> +between calls to getppid() (note: syscall_trace is active and calls
> +a marker upon syscall entry and syscall exit; markers are disarmed).
> +This test is done in user-space, so there are some delays due to IRQs
> +coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
> +nice level)
> +
> +My first set of results : Linear cache trashing, turned out not to be
> +very interesting, because it seems like the linearity of the memset on a
> +full array is somehow detected and it does not "really" trash the
> +caches.
> +
> +Now the most interesting result : Random walk L1 and L2 trashing
> +surrounding a getppid() call.
> +
> +- Markers compiled out (but syscall_trace execution forced)
> +number of tests : 10000
> +No memory pressure
> +Reading timestamps takes 108.033 cycles
> +getppid : 1681.4 cycles
> +With memory pressure
> +Reading timestamps takes 102.938 cycles
> +getppid : 15691.6 cycles
> +
> +
> +- With the immediate values based markers:
> +number of tests : 10000
> +No memory pressure
> +Reading timestamps takes 108.006 cycles
> +getppid : 1681.84 cycles
> +With memory pressure
> +Reading timestamps takes 100.291 cycles
> +getppid : 11793 cycles
> +
> +
> +- With global variables based markers:
> +number of tests : 10000
> +No memory pressure
> +Reading timestamps takes 107.999 cycles
> +getppid : 1669.06 cycles
> +With memory pressure
> +Reading timestamps takes 102.839 cycles
> +getppid : 12535 cycles
> +
> +The result is quite interesting in that the kernel is slower without
> +markers than with markers. I explain it by the fact that the data
> +accessed is not layed out in the same manner in the cache lines when the

laid out

> +markers are compiled in or out. It seems that it aligns the function's
> +data better to compile-in the markers in this case.
> +
> +But since the interesting comparison is between the immediate values and
> +global variables based markers, and because they share the same memory
> +layout, except for the movl being replaced by a movz, we see that the
> +global variable based markers (2 markers) adds 742 cycles to each system
> +call (syscall entry and exit are traced and memory locations for both
> +global variables lie on the same cache line).
> +
> +
> +- Test redone with less iterations, but with error estimates
> +
> +10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
> +syscall trace inactive, comparing memory pressure and w/o memory pressure.

^ +with (?)
also, spell out "without", please.

> +(sorry, my system is not setup to execute syscall_trace this time, but it will
> +make the point anyway).
> +
> +No memory pressure
> +Reading timestamps: 150.92 cycles, std dev. 1.01 cycles
> +getppid: 1462.09 cycles, std dev. 18.87 cycles
> +
> +With memory pressure
> +Reading timestamps: 578.22 cycles, std dev. 269.51 cycles
> +getppid: 17113.33 cycles, std dev. 1655.92 cycles
> +
> +
> +Now for memory read timing: (10 runs, branches per test: 100000)
> +Memory read based branch:
> + 644.09 cycles, std dev. 11.39 cycles
> +L1 cache hit based branch:
> + 88.16 cycles, std dev. 1.35 cycles
> +
> +
> +So, now that we have the raw results, let's calculate:
> +
> +Memory read:
> +644.09±11.39 - 88.16±1.35 = 555.93±11.46 cycles

What character is this that I cannot read (not displayed properly
by my email client maybe)? <something> after 644.09 and before the
+- symbol, repeated just before all of the +- symbols.


> +Getppid without memory pressure:
> +1462.09±18.87 - 150.92±1.01 = 1311.17±18.90 cycles
> +
> +Getppid with memory pressure:
> +17113.33±1655.92 - 578.22±269.51 = 16535.11±1677.71 cycles
> +
> +Therefore, if we add 2 markers not based on immediate values to the getppid
> +code, which would add 2 memory reads, we would add
> +2 * 555.93±12.74 = 1111.86±25.48 cycles
> +
> +Therefore,
> +
> +1111.86±25.48 / 16535.11±1677.71 = 0.0672
> + relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
> + = 0.1040
> + absolute error: 0.1040 * 0.0672 = 0.0070
> +
> +Therefore: 0.0672±0.0070 * 100% = 6.72±0.70 %
> +
> +We can therefore affirm that adding 2 markers to getppid, on a system with high
> +memory pressure, would have a performance hit of at least 6.0% on the system
> +call time, all within the uncertainty limits of these tests. The same applies to
> +other kernel code paths. The smaller those code paths are, the highest the
> +impact ratio will be.
> +
> +Therefore, not only is it interesting to use the immediate values to dynamically
> +activate dormant code such as the markers, but I think it should also be
> +considered as a replacement for many of the "read mostly" static variables.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/