Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

From: Andy Lutomirski
Date: Mon Nov 17 2014 - 19:06:49 EST

On Mon, Nov 17, 2014 at 3:16 PM, Luck, Tony <tony.luck@xxxxxxxxx> wrote:
>> I still wonder whether the timeout code is the real culprit. My patch
>> will slow down entry into do_machine_check by tens of cycles, several
>> cachelines, and possibly a couple of TLB misses. Given that the
>> timing seemed marginal to me, it's possible (albeit not that likely)
>> that it pushed the time needed for synchronization into the range of
>> unreliability.
> I think we have very conservative timeouts. Here are the significant bits from mce
> First SPINUNIT ... how many nanoseconds to spin "ndelay(SPINUNIT)" before pounding on an atomic op to see if we are done:
> #define SPINUNIT 100 /* 100ns */
> Initialization of our timeout - comment above this claims we are aiming for "one second"
> cfg->monarch_timeout = USEC_PER_SEC;
> Inside mce_start() - set up our timeout. Math says this will be 1e9
> u64 timeout = (u64)mca_cfg.monarch_timeout * NSEC_PER_USEC;

Aha. I missed the extra multiplication.

> Now the actual loop:
> while (atomic_read(&mce_callin) != cpus) {
> if (mce_timed_out(&timeout)) {
> atomic_set(&global_nwo, 0);
> return -1;
> }
> ndelay(SPINUNIT);
> }

And I missed the ndelay.

*sigh* reading comprehension wasn't so good there.

>From a debugging POV, it would be kind of nice to know which of the
four mce_timed_out calls is reporting the failure. IIUC, the first
one would mean that a cpu didn't make it in to do_machine_check at
all, whereas the second one would mean that a cpu got lost after
entering do_machine_check. Also, the second one presumably knows how
far it got (i.e. mce_executing and order).

It could also be interesting to tweak mce_panic to not actually panic
the machine but to try to return and stop the test instead. Then real
debugging could be possible :)

I still wish I could test this. The mce-inject code is completely
useless here, since it calls do_machine_check directly instead of
going through the MCE code.

> And the inside of mce_timed_out() ... we decrement our 1e9 by 100 for each call:
> if ((s64)*t < SPINUNIT) {
> if (mca_cfg.tolerant <= 1)
> mce_panic("Timeout synchronizing machine check over CPUs",
> cpu_missing = 1;
> return 1;
> }
> *t -= SPINUNIT;
> Not sure what the worst case is for waiting for other processors to show up. Perhaps it is
> when the other cpu is in some deep C-state and we have to wait for it to power the core
> back on and resync clock. Maybe it is when a cpu has just started a "wbinvd" instruction
> with a cache entirely full of dirty lines that belong on some other processor so have to
> be written back across QPI (may when QPI links are congested with 10gbit network traffic?
> But a full second seems like a lot ... you adding a handful of cache and TLB misses shouldn't
> push us over the edge.


> -Tony

Andy Lutomirski
AMA Capital Management, LLC
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at