Re: Smooth buffer for 2.0.31 [Was: kflushd() patch]

Dr. Werner Fink (werner@suse.de)
Tue, 19 Aug 1997 16:39:06 +0200


Hi,

just read your remarks and with a few comparisions to 2.1.51 I've rework
the patch which is appended after your quoted remarks.
I've added a simple mechanism to look after a first few locked buffers
without going trough the entire list. And I've change the condition for
calling wakeup_bdflush(1) in refill_freelist() ... maybe due to the change
above it's not necessary to add the locked number of buffers to the dirty
ones.

IMHO the fallback of calling grow_buffers(GFP_ATOMIC) should not fully
disappear to have a working fallback.

Comments?

Werner

> I would avoid calling GFP_ATOMIC in grow_buffers() altogether, and reserve
> those last few pages only for irq handler clients, for buffer heads labels
> for swapping requests in 2.0.x, etc.
>
> The buffer cache can potentially claim most of the system memory; declaring
> those last few pages "completely off limit" for it would not affect it very
> much, but would help a lot to the other sub-systems which desperately depend
> on them.
>
> Using GFP_ATOMIC, and avoiding calling wakeup_bdflush(1) on each cycle
> can result in zero free pages in the following two conditions:
>
> -- The 2.0.x kernels contain the following in find_candidate():
>
> if (buffer_locked(bh) && bh->b_list == BUF_LOCKED ...) {
> /* Buffers are written in the order they are placed
> on the locked list. If we encounter a locked buffer
> here, this means that the rest of them are also locked */
> (*list_len) = 0;
> return NULL;
> }
>
> This is incorrect. In the 2.0.x kernels, we can enter a
> situation in which most of the BUF_LOCKED list is reclaimable,
> but we will not be able to reclaim it since we have a single
> locked buffer in front of the list.
>
> The "nr_buffers_type[BUF_DIRTY] > 60%" will not be triggered
> in this case, as most of the buffers will be in the LOCKED
> list, and we will enter the GFP_ATOMIC calls.
>
> This was fixed in the 2.1.x kernels by traversing the entire list.
> I think that this fix potentially has a lot of CPU overhead,
> since many times the above "if we encounter a locked buffer,
> all of them are locked" argument *is* correct, and in those
> cases we will traverse more than 10000 elements on each cycle
> in search for a single buffer, only to reach failure.
>
> -- If the buffer cache is currently very small and the page
> cache is very big (for example, after tar cvf /dev/null
> /usr/), we can again enter the GFP_ATOMIC allocations.
>
> In that case, the effect of wakeup_bdflush(1) was not to really
> flush dirty buffers. Rather, the effect was to simply sleep for
> a while until kswapd() will move pages from the page cache into
> the free pages pool.
>
> Gadi
>

----------------------------------------------------------------------------
diff -urN linux-2.0.31-linus/fs/buffer.c linux/fs/buffer.c
--- linux-2.0.31-linus/fs/buffer.c Mon Aug 18 13:58:51 1997
+++ linux/fs/buffer.c Tue Aug 19 14:19:58 1997
@@ -543,14 +543,11 @@
static inline int can_reclaim(struct buffer_head *bh, int size)
{
if (bh->b_count ||
- buffer_protected(bh) || buffer_locked(bh))
+ buffer_protected(bh) ||
+ buffer_locked(bh) ||
+ mem_map[MAP_NR((unsigned long) bh->b_data)].count != 1 ||
+ buffer_dirty(bh))
return 0;
-
- if (mem_map[MAP_NR((unsigned long) bh->b_data)].count != 1 ||
- buffer_dirty(bh)) {
- /* WSH: don't attempt to refile here! */
- return 0;
- }

if (bh->b_size != size)
return 0;
@@ -559,13 +556,15 @@
}

/* find a candidate buffer to be reclaimed */
-static struct buffer_head *find_candidate(struct buffer_head *list,int *list_len,int size)
+static struct buffer_head *find_candidate(struct buffer_head *bh,
+ int *list_len, int size)
{
- struct buffer_head *bh;
+ int behind = 0;
+
+ if (!bh)
+ goto no_candidate;

- for (bh = list;
- bh && (*list_len) > 0;
- bh = bh->b_next_free, (*list_len)--) {
+ for (; (*list_len) > 0; bh = bh->b_next_free, (*list_len)--) {
if (size != bh->b_size) {
/* this provides a mechanism for freeing blocks
of other sizes, this is necessary now that we
@@ -575,21 +574,18 @@
break;
continue;
}
-
- if (buffer_locked(bh) &&
- (bh->b_list == BUF_LOCKED || bh->b_list == BUF_LOCKED1)) {
- /* Buffers are written in the order they are placed
- on the locked list. If we encounter a locked
- buffer here, this means that the rest of them
- are also locked */
- (*list_len) = 0;
- return NULL;
+ else if (buffer_locked(bh) &&
+ (bh->b_list == BUF_LOCKED || bh->b_list == BUF_LOCKED1)) {
+ if (behind++ > 10) {
+ (*list_len) = 0;
+ goto no_candidate;
+ }
}
-
- if (can_reclaim(bh,size))
- return bh;
+ else if (can_reclaim(bh,size))
+ return bh;
}

+no_candidate:
return NULL;
}

@@ -662,6 +658,13 @@
}
goto repeat;
}
+
+ /* Dirty buffers should not overtake, wakeup_bdflush(1) calls
+ bdflush and sleeps, therefore kswapd does his important work. */
+ if (((nr_buffers_type[BUF_DIRTY] + nr_buffers_type[BUF_LOCKED]) >
+ (nr_buffers * bdf_prm.b_un.nfract/100)) ||
+ (nr_free_pages < min_free_pages))
+ wakeup_bdflush(1);

/* Too bad, that was not enough. Try a little harder to grow some. */

@@ -672,19 +675,8 @@
};
}

-#if 0
- /*
- * In order to protect our reserved pages,
- * return now if we got any buffers.
- */
- if (free_list[BUFSIZE_INDEX(size)])
- return;
-
/* and repeat until we find something good */
- if (!grow_buffers(GFP_ATOMIC, size))
- wakeup_bdflush(1);
-#endif
- wakeup_bdflush(1);
+ grow_buffers(GFP_ATOMIC, size);

/* decrease needed even if there is no success */
needed -= PAGE_SIZE;
@@ -1717,7 +1709,7 @@
* dirty buffers, then make the next write to a
* loop device to be a blocking write.
* This lets us block--which we _must_ do! */
- if (ndirty == 0 && nr_buffers_type[BUF_DIRTY] > 0) {
+ if (ndirty == 0 && nr_buffers_type[BUF_DIRTY] > 0 && wrta_cmd != WRITE) {
wrta_cmd = WRITE;
continue;
}
@@ -1725,7 +1717,7 @@

/* If there are still a lot of dirty buffers around, skip the sleep
and flush some more */
- if(nr_buffers_type[BUF_DIRTY] <= nr_buffers * bdf_prm.b_un.nfract/100) {
+ if(ndirty == 0 || nr_buffers_type[BUF_DIRTY] <= nr_buffers * bdf_prm.b_un.nfract/100) {
wake_up(&bdflush_done);
current->signal = 0;
interruptible_sleep_on(&bdflush_wait);