Re: [RFC] f_pos in readdir() (was Re: [RFC][PATCH] vfs: always protectdiretory file->fpos with inode mutex)

From: Li Zefan
Date: Mon Feb 25 2013 - 01:10:47 EST

>> We can fix this bug in each filesystem, but can't we just make sure i_mutex is
>> acquired in lseek(), read(), write() and readdir() for directory file operations?
>> (the patch is for demonstration only)
> No. This is a very massive overkill. If anything, we want to *reduce* the
> amount of time we hold ->i_mutex in that area.
> There are several bugs mixed here:
> * disappearing exclusion between readdir and lseek for directories.
> Bad, since offset validation suddenly needs to be redone every time we look
> at ->f_pos in ->readdir() instances *and* since ->readdir() itself updates
> position, often by file->f_pos += something.
> * write(2) doing "get a copy of f_pos, try and fail ->write(),
> put that copy back". With no locking whatsoever. What we get is a f_pos
> value reverting to what it used to be at some random earlier point. Makes
> life really nasty for everything that updates ->f_pos, obviously.
> * read(2) doing the same, *and* some directories apparently having
> ->read() now.
> ->readdir() part of that would be the simplest one - we need to stop messing
> with ->f_pos and just pass an address of its copy, like we do for ->read()
> Preserving the method prototype is not worth it and this particular
> method has needed an overhaul of calling conventions for many reasons.
> The issue with write(2) and friends is potentially nastier. I'm looking
> at the ->f_pos users right now, and while the majority are ->readdir()
> and ->llseek() instances, there's some stuff beyond that. Some of that is
> done with struct file opened kernel-side and not accessible to userland;
> those are safe (and often really ugly - see drivers/media/pci/cx25821/
> hits for f_pos). Some are simply wrong - e.g. dev_mem_read()
> (in drivers/net/wireless/ti/wlcore/debugfs.c) ignores *ppos value and uses
> file->f_pos instead; wrong behaviour for ->read() instance. I'm about
> 20% through the list; so far everything seems to be possible to deal with
> (especially if we add a couple of helpers for common lseek variants and
> use existing generic_file_llseek_size()), so it might turn out to be
> not a serious issue, but it's a potential minefield. Hell knows...
> As for ->readdir(), I'd like to resurrect an old proposal to change the ABI
> of that sucker. Quoting the thread from 4 years ago:
> ====
> As for the locking... I toyed with one idea for a while: instead of passing
> a callback and one of many completely different structs, how about a common
> *beginning* of a struct, with callback stored in it along with several
> common fields? Namely,
> * count of filldir calls already made
> * pointer to file in question
> * "are we done" flag
> And instead of calling filldir(buf, ...) ->readdir() would call one of several
> helpers:
> * readdir_emit(buf, ...) - obvious
> * readdir_relax(buf) - called in a spot convenient for dropping
> and retaking lock; returns whether we need to do revalidation.
> * readdir_eof(buf) - end of directory
> * maybe readdir_dot() and readdir_dotdot() - those are certainly
> common enough
> That's the obvious flagday stuff, but since we need to give serious beating
> to most of the instances anyway... Might as well consider something in
> that direction.
> ====
> Back then it didn't go anywhere, but if we really go for change of calling
> conventions (passing a pointer to copy of ->f_pos), it would make a lot of
> sense, IMO. Note that ->i_mutex contention could be seriously relieved that
> way - e.g. ext* would just call readdir_relax() at the block boundaries,
> since those locations are always valid there, etc.

So there will be no lock to protect f_pos in read/write/llseek in your proposal.
Do we need to care about reading/writing fpos in 32 bit machine is not atomic?

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at