RE: Direct io on block device has performance regression on 2.6.x kernel

From: Chen, Kenneth W
Date: Thu Mar 10 2005 - 13:50:05 EST

Andrew Morton wrote on Wednesday, March 09, 2005 8:10 PM
> > 2.6.9 kernel is 6% slower compare to distributor's 2.4 kernel (RHEL3). Roughly
> > 2% came from storage driver (I'm not allowed to say anything beyond that, there
> > is a fix though).
> The codepaths are indeed longer in 2.6.

Thank you for acknowledging this.

> > 2% came from DIO.
> hm, that's not a lot.
> ....
> 2% is pretty thin :(

This is the exact reason that I did not want to put these numbers out
in the first place. Because most people usually underestimate the
magnitude of these percentage point.

Now I have to give a speech on "performance optimization 101". Take a
look at this page:
This page tracks the development of gcc and measures the performance of
gcc with SPECint2000. Study the last chart, take out your calculator
and calculate how much performance gain gcc made over the course of 3.5
years of development. Also please factor in the kind of man power that
went into the compiler development.

Until people understand the kind of scale to expect when evaluating a
complex piece of software, then we can talk about database transaction
processing benchmark. This benchmark goes one step further. It bench
the entire software stack (kernel/library/application/compiler), it bench
the entire hardware platform (cpu/memory/IO/chipset) and on the grand
scale, it bench system integration: storage, network, interconnect, mid-
tier app server, front end clients, etc etc. Any specific function/
component only represent a small portion of the entire system, essential
but small. For example, the hottest function in the kernel is 7.5%, out
of 20% kernel time. If we throw away that function entirely, there will
be only 1.5% direct impact on total cpu cycles.

So what's the point? The point is when judging a number whether it is
thin or thick, it has to be judged against the complexity of SUT. It
has to be judged against a relevant scale for that particular workload.
And the scale has to be laid out correctly that represents the weight
of each component.

Losing 6% just from Linux kernel is a huge deal for this type of benchmark.
People work for days to implement features which might give sub percentage
gain. Making Software run faster is not easy, but making software run slower
apparently is a fairly easy task.

> Fine-grained alignment is probably too hard, and it should fall back to
> __blockdev_direct_IO().
> Does it do the right thing with a request which is non-page-aligned, but
> 512-byte aligned?
> readv and writev?

That's why direct_io_worker() is slower. It does everything and handles
every possible usage scenarios out there. I hope making the function fatter
is not in the plan.

- Ken

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at