Re: 2.6.12-mm1 boot failure on NUMA box.

From: Martin J. Bligh
Date: Sat Jun 25 2005 - 01:43:30 EST




--Ingo Molnar <mingo@xxxxxxx> wrote (on Saturday, June 25, 2005 06:00:52 +0200):

>
> * Martin J. Bligh <Martin.Bligh@xxxxxxxxxx> wrote:
>
>> > is the only problem the unsyncedness? That should be fine as far as the
>> > scheduler is concerned. (we compensate for per-CPU drifts)
>>
>> Well, I think so. But I don't see how you're going to compensate for
>> large-scale unsynced-ness safely. I've always completely avoided the
>> TSC altogether on NUMA-Q ... would prefer to keep it that way. We got
>> lots of wierd random crashes, panics, hangs, and -ve time offsets
>> returned from userspace time commands last time I tried it.
>
> ok. Would be nice to check whether reverting that single change solves
> the boot problem. If it does i'll switch the measurement method to be
> do_gettimeoffset based, that way the measurement will still be accurate.
> (which is needed for precise migration cost results) Right now the
> calibration uses sched_clock() - which was the reason for the change.

OK, will test that.

> (btw., if the TSC is that unreliable on numaq boxes, shouldnt we disable
> it for userspace apps too? Or are those hangs purely kernel bugs? In
> which case it might make sense to debug those a bit more - large-scale
> TSC unsyncedness is something that could slip in on other hardware too.)

Well it reads reliably. it just reliably reads utter random crap (well,
across CPUs). Not many things read tsc from userspace, and it won't hang
I guess .... depends what their expecations are. I do like gettimeofday
not to go backwards though - that tends to bugger things up ;-)

M.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/