There was, at some point, discussion about using the gcc TLS mechanism, which should permit even better code to be generated. Unfortunately, it would require gcc to be able to reference %gs instead of %fs (and vice versa for i386), which I don't think is available in anything except maybe the most cutting-edge version of gcc.
You can't use __thread because GCC will cache __thread computed
addresses across context switches and cpu changes.
It's been tried before on powerpc, it doesn't work.