ok, now would this be useful? (Re: Dynamic configure max_cstate)

From: Andreas Mohr
Date: Tue Jul 28 2009 - 13:35:40 EST


On Tue, Jul 28, 2009 at 04:03:08PM +0200, Andreas Mohr wrote:
> Still, an average of +8.16% during 5 test runs each should be quite some incentive,
> and once there's a proper "idle latency skipping during expected I/O replies"
> even with idle/wakeup code path reinstated we should hopefully be able to keep
> some 5% improvement in disk access.

I went ahead and created a small and VERY dirty test for this.

In kernel/pm_qos_params.c I added

static bool io_reply_is_expected;

bool io_reply_expected(void)
{
return io_reply_is_expected;
}
EXPORT_SYMBOL_GPL(io_reply_expected);

void set_io_reply_expected(bool expected)
{
io_reply_is_expected = expected;
}
EXPORT_SYMBOL_GPL(set_io_reply_expected);



Then in drivers/ata/libata-core.c I added

extern bool set_io_reply_expected();

and updated it to

set_io_reply_expected(1);
rc = wait_for_completion_timeout(&wait, msecs_to_jiffies(timeout));
set_io_reply_expected(0);

ata_port_flush_task(ap);


Then I changed ./drivers/cpuidle/governors/menu.c
(make sure you're using the menu governor!) to use

extern bool io_reply_expected(void);

and updated

if (io_reply_expected())
data->expected_us = 10;
else {
/* determine the expected residency time */
data->expected_us =
(u32) ktime_to_ns(tick_nohz_get_sleep_length()) / 1000;
}

Rebuilt, rebootloadered ;), rebooted, and then booting and disk operation
_seemed_ to be snappier (I'm damn sure the hdd seek noise
is a bit higher-pitched ;).
And it's exactly seeks which should be shorter-intervalled now,
since the system triggers a hdd operation and then is forced to wait (idle)
until the seeking is done.

bonnie test results (of patched kernel vs. kernel with set_io_reply_expected() muted)
seem to support this, but then a "time make bzImage" (of newly rebooted box each)
showed inconsistent results again and a much higher sample rate (with reboots each)
would be needed to really confirm this.

I'd expect improvements to be in the 3% to 4% range, at most, but still,
compared to the yield of other kernel patches this ain't nothing.

Now the question becomes whether one should implement such an improvement and especially, how.
Perhaps the io reply decision making should be folded into the tick_nohz_get_sleep_length()
function (or rather create a higher-level "expected sleep length" function which consults both
tick_nohz_get_sleep_length() and io reply mechanism).
And another important detail is that my current hack completely ignores per-cpu operation
and thus causes suboptimal power savings of _all_ cpus,
not just the one waiting for the I/O reply (i.e., we should properly take into account
cpu affinity settings of the reply interrupt).
And of course it would probably be best to create a mechanism which stores a record of average
responsiveness delays of various block devices and then derive a maximum
idle wakeup latency value from this to request.

Does anyone else have thoughts on this or benchmark numbers which would support this?

Andreas Mohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/