Re: New partition type?

Riley Williams (rhw@MemAlpha.CX)
Sun, 9 May 1999 16:22:16 +0100 (GMT)


Hi Keith.

>> Personally, I would see oops-handling as a system-dependant
>> config option, with the following as possibilities:

>> 1. Write the oops to a dedicated oops partition.

>> If it's of interest, I'd be willing to have a go at writing
>> the driver for option (1) above, with the aim of making it
>> safe for use in most oops scenarios.

> OK, I'll bite. How do you plan to write to a disk partition
> after an Oops that has hung the machine? The hang can be caused
> by interrupts being disabled, by a lock deadlock, by bottom
> halves unable to run and probably some other causes as well.

The problem's more complex than that, so please refer to the analysis
I did in my last email to this list on the subject - the one between
the one you quote and this one. I've summarised part of it below as
well, but there's more in the previous analysis...

> IMHO, any attempt to write Oops to disk must be done outside the
> normal kernel I/O mechanisms. If your I/O uses any interrupts,
> gets any locks or does any bh processing, expect your dump to
> hang as well. Safely writing an Oops is all very well but any
> new mechanism needs to be reliable as well.

Precicely.

Also note that the proposed system, which I have christened SysLogFS
(for System Log File System), would carry much more than just oops
reports. As a result, it would be in pretty much constant use while
the kernel is running, so performance would be an issue here. I would
therefore see this as a two-edged sword:

1. When doing normal system logging before an oops has occurred,
use the normal read/write/fseek functions for performance, but
ensure that the data makes its way out to disk as quickly as
possible after being sent there.

2. When dealing with an oops, use direct disk reads/writes with
no interrupts or the like. At this point, performance becomes
irrelevant, BUT we must ensure that any previous logged data
that hasn't yet been written to disk gets written out as well.

In addition (and not mentioned in my previous emails), each such
message would need to be timestamped to be useful, so there would need
to be some SAFE means of determining the current time AFTER an oops
has occurred.

Like I said in my previous email, I would see the partition being used
as a circular buffer, so would propose the following structure for it:

Q> Offset Bytes Description Value
Q> ~~~~~~ ~~~~~ ~~~~~~~~~~~ ~~~~~
Q> 0 8 Signature "SysLogFS"
Q> 8 2 Endian detector 0x55AA
Q> 10 2 Version - Major 0x0000
Q> 12 2 Version - Minor 0x0000
Q> 14 2 Version - Release 0x0000
Q> 16 8 Date last written (Initially zero)
Q> 24 8 Read offset (Initially zero)
Q> 32 8 Write offset (Initially zero)
Q> 40 88 Reserved (All 0x00)
Q> 128 rest Buffer area (Initially zero)

Basically, the first 128 bytes of the area are reserved for details
relating to the contents of the log, and the rest of the area are used
for the actual text to be buffered, and retain the most recent entries
made, with a NUL byte following the last entry therein.

Allowing for typical drive parameters, here's what a single cylinder
partition would allow, assuming 512 byte sectors:

Q> Heads Sectors Cylinder Size Buffer Size Kilobytes
Q> ~~~~~ ~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~
Q> 16 16 131,072 130,944 127.875
Q> 16 32 262,144 262,016 255.875
Q> 16 63 516,096 515,968 503.875
Q> 32 63 1,032,192 1,032,064 1,007.875
Q> 64 63 2,064,384 2,064,256 2,015.875
Q> 128 63 4,128,768 4,128,640 4,031.875
Q> 255 63 8,225,280 8,225,152 8,032.375
Q> ~~~~~ ~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~

As you can see, even with the basic 16 heads and 16 sectors per track,
there is already well over 100k of buffer per cylinder, and on modern
high capacity drives, there's nearly 8M per cylinder, both of which
are pretty decent sized logs for most uses, although multi-cylinder
partitions would of course be supported.

> The more I look at this problem, the more I think that the only
> reliable option on a single machine is one that stores the Oops
> text in memory that does not get overwritten on reboot.

Whilst that would be a solution on some systems, not all systems have
suitable memory for that. As a result, there HAS to be a selection of
methods available, and I propose this as one of them, with the above
as another where available...

Best wishes from Riley.

+----------------------------------------------------------------------+
| There is something frustrating about the quality and speed of Linux |
| development, ie., the quality is too high and the speed is too high, |
| in other words, I can implement this XXXX feature, but I bet someone |
| else has already done so and is just about to release their patch. |
+----------------------------------------------------------------------+
* ftp://ftp.MemAlpha.cx/pub/rhw/Linux
* http://www.MemAlpha.cx/kernel.versions.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/