Re: [PATCH 0/2] pdflush fix and enhancement

From: Andi Kleen
Date: Wed Dec 31 2008 - 08:14:25 EST


> I say most because the assumption would be that we will be successful in
> creating the new thread. Not that bad an assumption I think. Besides,

And that the memory read is not reordered (rmb()).

> the consequences of a miss are not harmful.

Nod. Sounds reasonable.

>
> >
> > > More to the point, on small systems with few file systems, what is the
> > > point of having 8 (the current max) threads competing with each other on
> > > a single disk? Likewise, on a 64-way, or larger system with dozens of
> > > filesystems/disks, why wouldn't I want more background flushing?
> >
> > That makes some sense, but perhaps it would be better to base the default
> > size on the number of underlying block devices then?
> >
> > Ok one issue there is that there are lots of different types of
> > block devices, e.g. a big raid array may look like a single disk.
> > Still I suspect defaults based on the block devices would do reasonably
> > well.
>
> Could be... However bear in mind that we traverse *filesystems*, not
> block devs with background_writeout() (the pdflush work function).

My thinking was that on traditional block devices you roughly
want only N, N small number, flushers per spingle because
otherwise they will just seek too much.

Anyways iirc there's a way now to distingush SSDs from normal
block devices based on hints from the block layer, but that still
doesn't handle the big RAID array case well.

>
> But even if we did block devices, consider that we still don't know the
> speed of those devices (consider SSD v. raid v. disk) and consequently,
> we don't know how many threads to throw at the device before it becomes
> congested and we're merely spinning our wheels. I mean, an SSD at
> 500MB/s (or greater) certainly could handle more pages being thrown at
> it than an IDE drive...

I was thinking just of the initial default, but you're right
it really needs to tune the upper limit.

>
> And this ties back to MAX_WRITEBACK_PAGES (currently 1k) which is the
> chunk that we write out in one pass. In order to not "hold the inode
> lock too long", this is the chunk we attempt to write out.
>
> What is the right magic number for the various types of block devs? 1k
> for all? for all time? :-)

Ok it probably needs some kind of feedback mechanism.

Perhaps have keep an estimate of the average IO time for a single
flush and when it reaches some threshold start more threads?
Or have feedback from the elevators how busy they are.

Of course it would still need a upper limit to prevent
a thread explosion in case IO suddenly becomes very slow
(e.g. in a error recovery case), but it could be much
higher than today.

>
> Anyway, back to the traversal of filesystems. In writeback_inodes(), we
> currently traverse the super block list in reverse. I don't quite
> understand why we do this, but <shrug>.
>
> What this does mean is that unfairly penalize certain file systems when
> attempting to clean dirty pages. If I have 5 filesystems, all getting
> hit on, then the last one in will always be the 'cleanest'. Not sure
> that makes sense.

Probably not.

>
> I was thinking about a patch that would go both directions - forward and
> reverse depending upon, say, a bit in jiffies... Certainly not perfect,
> but a bit more fair.

Better a real RNG. But such probalistic schemes unfortunately tend to drive
benchmarkers crazy, that is why it is better to avoid them.

I suppose you could just keep some state per fs to ensure fairness.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/