[RFC] Implementing tape statistics
From: Seymour, Shane M
Date: Wed Mar 20 2013 - 20:54:15 EST
Before you start reading this I want to apologize in advance for the length of this email. The length is important though to make sure all of the arguments and counter-arguments are represented in asking for feedback about how tape statistics would be best implemented.
There is some demand for the provision of tape IO statistics by users of the enterprise distributions, in particular those possessing large scale tape libraries. The provision of interfaces for getting statistics about tape I/O for use by utilities such as sar is a feature present in most commercial UNIX distributions.
Several patches have been produced and presented to linux-scsi mailing list but it seems that there are differences of opinion that cannot be reconciled and hence currently no acceptance of the proposed solutions. I have therefore decided to post to the wider kernel list to see if we can come to some consensus on what one of these (or other) should be adopted.
No patches are presented in this email for the sake of brevity, it's only a summary of the implementation and consequences along with discussion points for each. Note that this is not an attempt to work around the feedback gained on the linux-scsi mailing list but an attempt to get a wider consensus on what would be an acceptable implementation of a tape statistics interface.
Provide device based stats vis sysfs:
/sys/class/scsi_tape/stNN/stats (where NN is the tape device instance number)
The stat file provides the following in a one line entry suitable for a single fgets() and processing by sscanf():
+/* Tape stats */
+ u64 read_byte_cnt; /* bytes read since tape open */
+ u64 write_byte_cnt; /* bytes written since tape open */
+ u64 in_flight; /* Number of I/Os in flight */
+ u64 read_cnt; /* Count of read requests since tape open */
+ u64 write_cnt; /* Count of write requests since tape open */
+ u64 other_cnt; /* Count of other requests since tape open
+ either implicit (from driver) or from
+ user space via ioctl. */
+ u64 read_ticks; /* Ticks spent completing read requests */
+ u64 write_ticks; /* Ticks spent completing write requests */
+ u64 io_ticks; /* Ticks spent doing any I/O */
+ u64 stamp; /* holds time request was queued */
The file contents are almost the same as the stat file for disks except the merge statistics are always 0 (since tape drives are sequential merged I/Os don't make sense) and the inflight value is almost always either a 0 or 1 since the st module always only has either one read or write outstanding. An additional field is added to the end of the file - a count of other I/Os - this could be commands issued by the driver within the kernel (e.g. rewind) or via an ioctl from user space. For tape drives some commands involving actions like tape movement can take a long time, it's important to keep track of scsi requests sent to the tape drive other than reads and writes so when delays happen they can be explained.
With some future patches to iostat this data will be reported, an example set of data is (the extra other_cnt data allows an average wait for all (a_await) and other I/Os per second (oio/s)):
tape: wr/s KiB_write/s rd/s KiB_read/s r_await w_await a_await oio/s
st0 186.50 46.75 0.00 0.00 0.000 0.276 0.276 0.00
st1 186.00 93.00 0.00 0.00 0.000 0.180 0.180 0.00
st2 0.00 0.00 181.50 45.50 0.347 0.000 0.347 0.00
st3 0.00 0.00 183.00 45.75 0.224 0.000 0.224 0.00
## This is our preferred method of implementation since it is efficient for both kernel and user-space (also requires fewest code changes), it also matches that already presented for the disk block subsys, see for example:
# grep . /sys/block/sd*/stat
/sys/block/sda/stat: 27351 6890 609272 228129 36810 920727 7660304 1333950 0 556889 1562009
/sys/block/sdb/stat: 2369 6762 18890 39003 0 0 0 0 0 4059 39002
## SCSI maintainers counter-point: "I'm afraid we can't do it the way you're proposing. files in sysfs must conform to the one value per file rule (so we avoid the ABI nastiness that plagues /proc). You can create a stat directory with a bunch of files, but not a single file that gives all values.
## My counter:
I can only assume it (sysfs blk_subsys stat file) was implemented this way for the sake of efficiency, eg avoid a huge amount of file open/read/close calls in sar/iostat. It's not unusual for us to see over a thousand block devices on enterprise servers, multiply that by the number of above entries and you would be talking about 9 x block-dev-count per iostat read iteration. Okay for tapes we typically don't see anything like this number but the patch just follows the precedent set with the block device.
The sysfs.txt docs say:
Attributes can be exported for kobjects in the form of regular files in the filesystem. Sysfs forwards file I/O operations to methods defined for the attributes, providing a means to read and write kernel attributes.
Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type.
Nevertheless, to address the concerns of the 1 record per 1 sysfs file a prototype with the stats broken out as follows has also been tested:
# cd /sys/class/scsi_tape/st0/device/statistics
-r--r--r--. 1 root root 4096 Mar 1 09:33 in_flight
-r--r--r--. 1 root root 4096 Mar 1 09:33 io_ms
-r--r--r--. 1 root root 4096 Mar 1 09:33 other_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 read_block_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 read_byte_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 read_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 read_ms
-r--r--r--. 1 root root 4096 Mar 1 09:33 write_block_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 write_byte_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 write_cnt
-r--r--r--. 1 root root 4096 Mar 1 09:33 write_ms
## I dislike this breakout because it adds complexity in both the kernel st.ko driver (300 odd extra lines of code) as well as making life more complicated and less efficient in user-space. The st module maintainer is against this option being implemented as the one and only solution, they would prefer Option#1 or Option#3.
Provide Option#1 _and/or_ Option#2 via debugfs, where structure is less restricted.
## This is a compromise. It seems that almost anything is acceptable here (few constraints) so either options could be used. However we dislike the notion of using debugfs for several reasons:
* we are not presenting internal, technical info for developers, this is primarily device based counters/statistics to be used by apps such as iostat/sar etc in the same way as /sys/block/sd*/stat can be now
* debugfs IS typically included in the enterprise distributions but not mounted as a matter of course (sysfs is), hence for the user-space apps to work users would have to take action to ensure that debugfs is mounted
* more code/complexity has to be added in to st.ko to support this implementation, in either form, more than for even Option#2
This file contains an integer count to indicate the maximum number of tape drives connected to the system since boot. The value is incremented for each device discovered in st_probe(). The purpose of this file is to just provide an upper level hint to assist user-space apps to iterate over the stNN devices (eg: for (stN=0; stN < drives; stN++)). The value is not decremented when a drive is removed. It is left to user-space to detect missing devices (those removed), remembering that in SAN based tape libs devices can come/go and there could be many dozens of tape drives. This should help with user-space coding efficiency.
A system-wide boolean to control the behaviour of the individual tape stats, ie should they be reset to zero upon device open (by default they are not).
Again these could be presented under debugfs.
If the only way we can get agreement for acceptance is with Option#3 then we will concede and reimplement using debugfs but I would appreciate further comments on the above proposals.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/