Re: [RFC PATCH 00/16] PTI support for x86-32

From: H. Peter Anvin
Date: Mon Jan 22 2018 - 16:30:32 EST


On 01/22/18 12:14, Linus Torvalds wrote:
> On Sun, Jan 21, 2018 at 6:20 PM, <hpa@xxxxxxxxx> wrote:
>>
>> No idea about Intel, but at least on Transmeta CPUs the limit check was asynchronous with the access.
>
> Yes, but TMTA had a really odd uarch and didn't check segment limits natively.
>

Only on TM3000 ("Wilma") and TM5000 ("Fred"), not on TM8000 ("Astro").
Astro might in fact have been more synchronous than most modern machines
(see below.)

> When you do it in hardware. the limit check is actually fairly natural
> to do early rather than late (since it acts on the linear address
> _before_ base add and TLB lookup).
>
> So it's not like it can't be done late, but there are reasons why a
> traditional microarchitecture might always end up doing the limit
> check early and so segmentation might be a good defense against
> meltdown on 32-bit Intel.

I will try to investigate, but as you can imagine the amount of
bandwidth I might be able to get on this is definitely going to be limited.

All of the below is generic discussion that almost certainly can be
found in some form in Hennesey & Patterson, and so I don't have to worry
about giving away Intel secrets:

It isn't really true that it is natural to check this early. One of the
most fundamental frequency limiters in a modern CPU architecture
(meaning anything from the last 20 years or so) has been the
data-dependent AGU-D$-AGU loop. Note that this doesn't even include the
TLB: the TLB is looked up in parallel with the D$, and if the result was
*either* a cache-TLB mismatch or a TLB miss the result is prevented from
committing.

In the case of the x86, the AGU receives up to three sources plus the
segment base, and if possible given the target process and gates
available might be designed to have a unified 4-input adder, with the
3-input case for limit checks being done separately.

Misses and even more so exceptions (which are far less frequent than
misses) are demoted to a slower where the goal is to prevent commit
rather than trying to race to be in the data path. So although it is
natural to *issue* the load and the limit check at the same time, the
limit check is still going to be deferred. Whether or not it is
permitted to be fully asynchronous with the load is probably a tradeoff
of timing requirements vs complexity. At least theoretically one could
imagine a machine which would take the trap after the speculative
machine had already chased the pointer loop several levels down; this
would most likely mean separate uops to allow for the existing
out-of-order machine to do the bookkeeping.

-hpa