[rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64

From: clameter
Date: Mon Nov 19 2007 - 20:13:48 EST


This is a pretty early draft stage of the patch. It works on
x86_64 only. Its a bit massive so I'd like to have some feedback
before proceeding (and maybe some help)?.

The support for other arches was not tested yet.

The patch establishes a new set of cpu operations that allow to
exploit single instruction atomicity to allow per cpu variable
modifications without disabling/enabling preempt or interrupts and
without the need to do an offset calculation in order to determine
the location of the variable on the current processor.

It then implements these operations on x86_64 after consolidating
per cpu access for allocpercpu, percpu and pda. All per
cpu data is then accessible via gs segment override.

This results in a reduction in code size of the kernel and in more efficient
operation of per cpu access.

Before:
text data bss dec hex filename
4041907 512371 1302360 5856638 595d7e vmlinux

After (this includes the code added for the cpu allocator!):

text data bss dec hex filename
3861532 527715 1298072 5687319 56c817 vmlinux


On x86_64 the segment override results in the following change for a simple
vm counter increment:

Before:

mov %gs:0x8,%rdx Get smp_processor_id
mov tableoffset,%rax Get table base
incq varoffset(%rax,%rdx,1) Perform the operation with a complex lookup
adding the var offset

An interrupt or a reschedule action can move the execution thread to another
processor if interrupt or preempt is not disabled. Then the variable of
the wrong processor may be updated in a racy way.

After:

incq %gs:varoffset(%rip)

Single instruction that is safe from interrupts or moving of the execution
thread. It will reliably operate on the current processors data area.

Other platforms can also perform address relocation plus atomic ops on
a memory location. Exploiting of the atomicity of instructions vs interrupts
is therefore possible and will reduce the cpu op processing overhead.

F.e on IA64 we have per cpu virtual mapping of the per cpu area. If
we add an offset to the per cpu area variable address then we can guarantee
that we always hit the per cpu areas local to a processor.

Other platforms (SPARC?) have registers that can be used to form addresses.
If the cpu area address is in one of those then atomic per cpu modifications
can be generated for those platforms in the same way.

Slub best performance in the fast fastpath goes from 47 cycles to 41 cycles
through the use of the segment override.


--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/