Some questions and discussion about fs buffers (bread brelse) and mm

pisa (pisa@waltz.felk.cvut.cz)
Sun, 17 Aug 1997 02:33:35 +0000 (GMT)


Hi everybody,

I hope that, my notices are not only waste of your time.
Next notices are not meant as critisism. I like Linux much and hope
that it is OS of future. But I want it to be as good as possible.
I am cooperating with Frank Gockel on his dmsdosfs ( Doublespace,
Drivespace and Stacker filesystem ). I have written most of Stacker
relevant code. So I have interrest in kernel internals.

My notices are related to 2.1.48 kernel.

breada, brelse
==============
Why is wait_on_buffer needed by brelse ?
As I understand, main purpose is to call refile_buffer, after
buffer is readed or written by block device driver
and end_request is called.
But this behaviour can decrease performance.
When ll_rw_block is called for read ahead purposes,
caller have to wait for this block in brelse or have to
store bh for next time (some filesystems then holds these bhs
for long time -> decrease of cache auto balancing ).
Other fs uses breada, but it must wait for all prereaded blocks
before return. So breada can decrease performance for device
with short seek time but slow data transfer or for big
read ahead block count.

My opinion:
do not wait in brelse,
end_request needs to invoke refile_buffer some other way
- one solution is to use atomicaly maintained list of blocks,
which needs refile, and wakeup of some daemon.
- second way is to atomicaly increment count of blocks which
needs refile, wakeup daemon ( may be kflushd ), which would scan
blocks on request list and call refile as long as count becomes
zerro. It can be improved by use of per device scan and oldest
block on list scanned first. Some special devices can then have
their own thread for this scan and request proccessing.

Some more advantage discussion:
Why is locking needed for blocks being written.
It can be good only for some temporrary consistency of filesystem.
But more proccesees can point to bh and modify data at the time of writting at present state too, so nothing can be lost.
Only reasson for waiting is time after allocation of buffer and waiting
for it to comes up-to-date in bread and wanted bh in breada.

ll_rw_block for SMP
===================
it seems that multiple CPUs can step in fs code at once now,
nice, but there can be some race contitions.
ll_rw_block can be called by more CPUs at once, but race is not solved
there, lock bh is applied in make_request. ll_rw_block part of code
can be proccessed twice (brobably no break) only performance decrease.

do_try_to_free_page
===================
It seems, that state is not changed, if function per this state succeeds.
It means, that memory of actual subsystem is decreased as long as any
memory can be unalocated. I hope that balancing was in mind by this
function, so next code is better

case X:
state++;
if (functionX()) return 1;

Question:
We need our own cache, which needs be balanced with usual memory caches.
I have not found such functionality in do_try_to_free_page and
kmem_cache_reap. Cases seems to be hard coded. We need our own
try to free function to be called, but code must be compiled as
loadable module too. Something like register_try_to_free_function
and unregister_try_to_free_function will be nice.
Better solution can be register memory eater subsystem.
Subsystem should update structure with next members

struct mem_eater_subsystem_s {
struct mem_eater_subsystem_s *next;
long LRU_last_access; /* time of last access to
next candidate for freeing */
int reget_time_cost; /* cost of new building of that candidate */
int reget_probability; /* probability of reuse of candidate */
int (*try_to_free)(int pri); /* subsystem function */
} mem_eater_subsystem_t;

do_try_to_free_page should use these structures for balancing
memory between subsystems.

DMA allocation problem
======================
I do not know, if next feature of almost all Pentium boards is known.
DMA can be configured into scatter-gather mode. Then maximal count
of transfer is 16MB and memory addresses are defined in descriptor
block. Only 32 bit physical address of this block is stored in DMA.

Every descriptor is 64 bit long and has next structure in bist
63 EOL - when 1, it is last descriptor
62..56 reserved
55..32 word counter
31.. 0 starting address of continuous block

System needs only fill this table by counted physical addresses
and sizes of continuous blocks. Last block has to thave EOL=1.
So there is no need for special handling of DMA memory on newer
boards.

>From INTEL 82374EB, 82374EB, 82378IB documentation.

By the way, I have ported kdebug for 2.1.xx,
but I cannot contact maintainer. If you have interrest I will
send it to you. I have some problems with gdb to fixup
add-symbol-file for loadable modules.

Happy Linux session,
Pavel
pisa@cmp.felk.cvut.cz

P.S.: Excuse my English and typos
Please send copy of any answer to my address too