Re: [PATCH v2][RESEND] seq_file: don't set read position for invalid iterator

From: Miklos Szeredi
Date: Wed Nov 02 2016 - 05:11:57 EST

On Mon, Oct 31, 2016 at 10:32:21AM +0100, Tomasz Majchrzak wrote:
> On Wed, Oct 26, 2016 at 11:17:13AM +0200, Miklos Szeredi wrote:
> > On Wed, Oct 12, 2016 at 2:07 PM, Tomasz Majchrzak
> > <tomasz.majchrzak@xxxxxxxxx> wrote:
> > > If kernfs file is empty on a first read, successive read operations
> > > using the same file descriptor will return no data, even when data is
> > > available. Default kernfs 'seq_next' implementation advances iterator
> > > position even when next object is not there. Kernfs 'seq_start' for
> > > following requests will not return iterator as position is already on
> > > the second object.
> > >
> > > This bug doesn't allow to monitor badblocks sysfs files from MD raid.
> > > They are initially empty but if data appears at some stage, userspace is
> > > not able to read it. It doesn't affect any released applications but it
> > > is necessary for upcoming bad block support for external metadata in MD
> > > raid.
> >
> > What is the expectation from the userspace application? Should
> > seq_file support "tail -f" as more data is added to the file? AFAICS
> > this patch doesn't address that generally, just the empty->nonempty
> > transition (with the single record case).
> Yes, it fixes only empty-nonempty transition. I believe it already works
> fine ("tail -f") if the file cursor is in the middle of the file,

"tail -f" works if there are multiple records. It doesn't work for the "single"
style of seq_file used by sysfs.

> however
> mdmon never performs such operation - it "acknowledges" entry read from the
> start of one sysfs file (unacknowledged_bad_blocks) via other sysfs file
> (bad_blocks) and reads the sysfs file again from the start (there is new
> data at position 0 at this stage).
> My patch just resolves a regression introduced by a fix for multiple record
> file: 4cdfe84b514 ("deal with the first call of ->show() generating no
> output").

The above commit (from 2.6.27) didn't change the beavior relative to sysfs
files, AFAICS. Before that commit the index was incremented if the first
iterator was empty, just like after the patch.

> It makes it consistent with rest of the code in seq_file
> (seq_read, traverse) - new iterator position is ignored if valid iterator
> has not been returned.

Your patch doesn't seem to make the behavior consistent, it just special cases
the last record being empty. What if the last two (three, etc) records are
empty? It would be consistent if m->index always pointed to the first record
matching the given file position, for example.

But I doubt that we actually need to do that. For example just special casing
the zero offset (always translating to zero index) would be conceptually simpler
and equivalent to your patch for the sysfs case.

But see below. I still don't see what you gain by not doing the open/read/close
when polling the badblocks file.

> > Why does userspace app not do open+read+close each time it polls the
> > badblocks file?
> This excerpt from mdadm/mdmon-design.txt file explains it:
> "
> The 'monitor' has the primary role of monitoring the array for
> important state changes and updating the metadata accordingly. As
> writes to the array can be blocked until 'monitor' completes and
> acknowledges the update, it much be very careful not to block itself.
> In particular it must not block waiting for any write to complete else
> it could deadlock. This means that it must not allocate memory as
> doing this can require dirty memory to be written out and if the
> system choose to write to the array that mdmon is monitoring, the
> memory allocation could deadlock.
> So 'monitor' must never allocate memory and must limit the number of
> other system call it performs. It may:
> - use select (or poll) to wait for activity on a file descriptor
> - read from a sysfs file descriptor
> - write to a sysfs file descriptor
> - write the metadata out to the block devices using O_DIRECT
> - send a signal (kill) to the manager thread
> It must not e.g. open files or do anything similar that might allocate
> resources.
> "

seq_read() *will* allocate memory; see seq_buf_alloc().