My far-into-the-future ideal for this is to have a generic vDSO-type
library that is compiled into the kernel that provides a collection of
architecture-optimized routines available in both kernelspace and
userspace by mapping it into each process' address space. Such a
library could effectively automatically provide correct and optimized
assembly routines for the currently booted CPU/arch/subarch/etc, so
that userspace tools could be compiled once and run on an entire
family of CPUs without modification. On the other hand, for those
applications that need every last ounce of speed (Including parts of
the kernel), you could pass appropriate options to the compiler to
tell it to inline the assembly routines (alternative) for a single
CPU make/model.