Re: [revert] mysql+oltp regression

From: Gregory Haskins
Date: Mon Aug 11 2008 - 09:06:12 EST


Ingo Molnar wrote:
* Gregory Haskins <ghaskins@xxxxxxxxxx> wrote:

Ingo Molnar wrote:
* Mike Galbraith <efault@xxxxxx> wrote:

Greetings,

During regression testing of tip/sched/clock fixes, a regression in low client count throughput turned up, which I traced this back to the commit below. I don't see anything wrong with it, but suspect that it is preventing client/server pairs from staying together on the same CPU as buddies, which mysql definitely likes quite a lot. (I suspect that this is the case, because I've seen this same performance curve while tinkering with wakeup affinity and breaking it all to pieces;)

Changelog and test results below in case nobody sees a problem with the commit itself.
i've applied your fix to tip/sched/urgent for the time being, thanks Mike for tracking it down. We can re-try newer iterations of Greg's patch in tip/sched/devel.

Hmm.. The patch still looks correct afaict. I fear we are just papering over some other issue by reverting it, but I will try to see if I can track this down. We will, of course, now be skipping trying to balance the (effectively random) last task in the queue which may or may not result in better performance on sheer luck instead of algorithmic intelligence. This makes me nervous.

yeah - but we had that behavior for quite some time.

This is how the patch cycle works normally: we had a fair chance to discover this problem in your testing then in -tip testing and then in linux-next or -mm but we didnt find it at any stage.

Now we are in the upstream release cycle so unless there's some immediate fix available (or there are _really_ strong reasons against the revert) doing the revert is the right approach.

A revert is not necessarily the indicator of the quality of the change in question, it is a tester-driven exception event that guarantees that the kernel improves in a monotonic way. (for all testers who opt to help us in doing so)

And given that the problem was readily reproducible for Mike, it should be reproducible for you as well - so we dont actually make the bug harder to fix by doing the revert.

Perhaps we should introduce the notion of "Defer-to-next-release" reverts - which this really is - in contrast to "Revert-because-bad", which your change definitely is not.

Hi Ingo,
Understood, and a totally reasonable stance. I mostly wanted to make sure it was understood that I don't think I can "fix" that particular patch since I think it was already correct. Rather, I will have to try to identify some other area (presumably the load balancer) to harmonize with it. I think we are on the same page, though. :)


Speaking of this: Another patch I submitted to you Ingo (had to do with updating the load_weight inside task_setprio) seems to also have this phenomenon: e.g. its technically correct but further testing has revealed negative repercussions elsewhere. So please ignore that patch (or revert if you already pulled in, but I don't think you have). Ill try to look into this issue as well.

ok, under which thread/subject is that? Not queued in tip/sched/* yet, correct?
Here is the original thread:

http://lkml.org/lkml/2008/7/3/416

I do not believe you have queued it anywhere (public anyway) yet.

Note I have already invalidated 1/2, and now I am retracting 2/2 as well. (1/2 is actually a bogus patch, 2/2 is "technically correct" but causes ripples in the load balancer that need to be sorted out first.

Thanks!
-Greg


Attachment: signature.asc
Description: OpenPGP digital signature