Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.

From: Borislav Petkov
Date: Mon Jul 14 2014 - 11:15:28 EST

Next message: Oleg Nesterov: "Re: sched, timers: use after free in __lock_task_sighand when exiting a process"
Previous message: Christoph Lameter: "Re: [RFC/PATCH -next 00/21] Address sanitizer for kernel (kasan) - dynamic memory error detector."
In reply to: Havard Skinnemoen: "Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values."
Next in thread: Borislav Petkov: "Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Jul 11, 2014 at 05:10:07PM -0700, Havard Skinnemoen wrote:
> 200ms per second means we're using 20% of that CPU. I'd say that's
> definitely too much. But I like the general approach.

Right.

> > Yeah, by "generous" I meant, choose values which fit all. But I realize
> > now that this is a dumb idea. Maybe we could measure it on each system,
> > read the TSC on CMCI entry and exit and thus get an average CMCI
> > duration...
>
> Sounds interesting. Some things that may need some more thought:
>
> 1. What percentage of CPU is OK to use before we consider it a storm?

That is a very good question. Normally, when we don't know that answer,
we leave it user-configurable with a sane default :-)

But if we have to be realistic, anything above 20% of CPU time spent in
storm mode for prolonged periods of time would probably mean this system
needs to get scheduled for maintenance anyway.

The whole storm thing is basically showing that a system is about to
fail soon and we're trying to alleviate performance hit from too high
CMCI counts by switching to polling, i.e., prolonged, more graceful hw
fail. :-)

> 2. How do we map that number to polling mode, where we may not see all
> the errors? If we get it wrong, we may end up bouncing at a very high
> rate.

Well, with polling you're bound to miss some errors anyway.

> 3. If we go for a fixed polling rate, how do we make sure it doesn't
> require more CPU than what we determined in (1)?

Yeah, that's the disadvantage of fixed polling rate - we won't know.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Oleg Nesterov: "Re: sched, timers: use after free in __lock_task_sighand when exiting a process"
Previous message: Christoph Lameter: "Re: [RFC/PATCH -next 00/21] Address sanitizer for kernel (kasan) - dynamic memory error detector."
In reply to: Havard Skinnemoen: "Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values."
Next in thread: Borislav Petkov: "Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]