Re: [PATCH] fs: fix data races on inode->i_flctx
From: Dmitry Vyukov
Date: Mon Sep 21 2015 - 07:54:26 EST
No hurry on my side.
Thanks
On Mon, Sep 21, 2015 at 1:44 PM, Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
> On Mon, 21 Sep 2015 13:38:45 +0200
> Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>
>> Yes. That processor is Alpha and that's documented in DATA DEPENDENCY
>> BARRIERS section of Documentation/memory-barriers.txt.
>>
>>
>
> Ok, thanks for the explanation. Patch looks fine to me. I'll go ahead
> and merge it for v4.4. Let me know though if you think this needs to go
> in sooner.
>
> Thanks,
> Jeff
>
>>
>> On Mon, Sep 21, 2015 at 1:17 PM, Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
>> > On Mon, 21 Sep 2015 13:00:04 +0200
>> > Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>> >
>> >> On Mon, Sep 21, 2015 at 12:56 PM, Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
>> >> > On Mon, 21 Sep 2015 12:53:40 +0200
>> >> > Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>> >> >
>> >> >> On Mon, Sep 21, 2015 at 12:50 PM, Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
>> >> >> > On Mon, 21 Sep 2015 09:43:06 +0200
>> >> >> > Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote:
>> >> >> >
>> >> >> >> locks_get_lock_context() uses cmpxchg() to install i_flctx.
>> >> >> >> cmpxchg() is a release operation which is correct. But it uses
>> >> >> >> a plain load to load i_flctx. This is incorrect. Subsequent loads
>> >> >> >> from i_flctx can hoist above the load of i_flctx pointer itself
>> >> >> >> and observe uninitialized garbage there. This in turn can lead
>> >> >> >> to corruption of ctx->flc_lock and other members.
>> >> >> >>
>> >> >> >
>> >> >> > I don't get this bit. The i_flctx field is initialized to zero when the
>> >> >> > inode is allocated, and we only assign it with cmpxchg(). How would you
>> >> >> > ever see uninitialized garbage in there? It should either be NULL or a
>> >> >> > valid pointer, no?
>> >> >>
>> >> >> This is not about i_flctx pointer, this is about i_flctx object
>> >> >> contents (pointee).
>> >> >> Readers can see a non-NULL pointer, but read garbage from *i_flctx.
>> >> >>
>> >> >
>> >> > Again, I don't get it though. The cmpxchg happens after we've
>> >> > initialized the structure. Given that there are implicit full memory
>> >> > barriers before and after the cmpxchg(), how do you end up with another
>> >> > task seeing the pointer before it was ever initialized?
>> >> >
>> >> > The store should only happen after the initialization has occurred, and
>> >> > the loads in the other tasks shouldn't be able to see the results of
>> >> > that store until then. No?
>> >>
>> >> If a reader looks at things in the opposite order, then it does not
>> >> matter how you store them. Reader still can observe them in the wrong
>> >> order.
>> >> Memory barriers is always a game of two. Writer needs to do stores in
>> >> the correct order and reader must do reads in the correct order. A
>> >> single memory barrier cannot possible make any sense.
>> >>
>> >
>> > I get that, but...isn't there a data dependency there? In order for the
>> > reader to see bogus fields inside of i_flctx, it needs to be able to
>> > see a valid pointer in i_flctx...and in order for that to occur the
>> > store has to have occurred.
>> >
>> > I guess the concern is that the reader CPU could stumble across some
>> > memory that is eventually going to part of i_flctx prior to loading
>> > that pointer, and assume that its contents are still valid after the
>> > pointer store has occurred? Is that the basic scenario?
>> >
>> >>
>> >> >> > If that'st the case, then most of the places where you're adding
>> >> >> > smp_load_acquire are places that can tolerate seeing NULL there. e.g.
>> >> >> > you have racing LOCK_EX/LOCK_UN flock() calls in different tasks or
>> >> >> > something.
>> >> >> >
>> >> >> >> Documentation/memory-barriers.txt explicitly requires to use
>> >> >> >> a barrier in such context:
>> >> >> >> "A load-load control dependency requires a full read memory barrier".
>> >> >> >>
>> >> >> >
>> >> >> > The same file also says:
>> >> >> >
>> >> >> > "Any atomic operation that modifies some state in memory and returns information
>> >> >> > about the state (old or new) implies an SMP-conditional general memory barrier
>> >> >> > (smp_mb()) on each side of the actual operation (with the exception of
>> >> >> > explicit lock operations, described later). These include:
>> >> >> >
>> >> >> > ...
>> >> >> > /* when succeeds */
>> >> >> > cmpxchg();
>> >> >> > "
>> >> >> >
>> >> >> > If there is an implied smp_mb() before and after the cmpxchg(), how
>> >> >> > could the CPU hoist anything before that point?
>> >> >> >
>> >> >> > Note that I'm not saying that you're wrong, I just want to make sure I
>> >> >> > fully understand the problem before we go trying to fix it.
>> >> >> >
>> >> >> >> Use smp_load_acquire() in locks_get_lock_context() and in bunch
>> >> >> >> of other functions that can proceed concurrently with
>> >> >> >> locks_get_lock_context().
>> >> >> >>
>> >> >> >> The data race was found with KernelThreadSanitizer (KTSAN).
>> >> >> >>
>> >> >> >> Signed-off-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
>> >> >> >> ---
>> >> >> >> For the record, here is the KTSAN report:
>> >> >> >>
>> >> >> >> ThreadSanitizer: data-race in _raw_spin_lock
>> >> >> >>
>> >> >> >> Read at 0xffff8800bae50da0 of size 1 by thread 3060 on CPU 2:
>> >> >> >> [< inline >] __raw_spin_lock include/linux/spinlock_api_smp.h:158
>> >> >> >> [<ffffffff81ee37d0>] _raw_spin_lock+0x50/0x70 kernel/locking/spinlock.c:151
>> >> >> >> [< inline >] spin_lock include/linux/spinlock.h:312
>> >> >> >> [<ffffffff812e1187>] flock_lock_inode+0xb7/0x440 fs/locks.c:895
>> >> >> >> [<ffffffff812e16e6>] flock_lock_inode_wait+0x46/0x110 fs/locks.c:1871
>> >> >> >> [<ffffffff812e2814>] SyS_flock+0x224/0x23
>> >> >> >>
>> >> >> >> Previous write at 0xffff8800bae50da0 of size 8 by thread 3063 on CPU 8:
>> >> >> >> [<ffffffff81248628>] kmem_cache_alloc+0xd8/0x2f0 mm/slab.c:3420
>> >> >> >> [<ffffffff812ddba0>] locks_get_lock_context+0x60/0x140 fs/locks.c:213
>> >> >> >> [<ffffffff812e112a>] flock_lock_inode+0x5a/0x440 fs/locks.c:882
>> >> >> >> [<ffffffff812e16e6>] flock_lock_inode_wait+0x46/0x110 fs/locks.c:1871
>> >> >> >> [< inline >] flock_lock_file_wait include/linux/fs.h:1219
>> >> >> >> [< inline >] SYSC_flock fs/locks.c:1941
>> >> >> >> [<ffffffff812e2814>] SyS_flock+0x224/0x230 fs/locks.c:1904
>> >> >> >> [<ffffffff81ee3e11>] entry_SYSCALL_64_fastpath+0x31/0x95 arch/x86/entry/entry_64.S:188
>> >> >> >> ---
>> >> >> >> fs/locks.c | 63 +++++++++++++++++++++++++++++++++++---------------------------
>> >> >> >> 1 file changed, 36 insertions(+), 27 deletions(-)
>> >> >> >>
>> >> >> >> diff --git a/fs/locks.c b/fs/locks.c
>> >> >> >> index 2a54c80..316e474 100644
>> >> >> >> --- a/fs/locks.c
>> >> >> >> +++ b/fs/locks.c
>> >> >> >> @@ -205,28 +205,32 @@ static struct kmem_cache *filelock_cache __read_mostly;
>> >> >> >> static struct file_lock_context *
>> >> >> >> locks_get_lock_context(struct inode *inode, int type)
>> >> >> >> {
>> >> >> >> - struct file_lock_context *new;
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >>
>> >> >> >> - if (likely(inode->i_flctx) || type == F_UNLCK)
>> >> >> >> + /* paired with cmpxchg() below */
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> + if (likely(ctx) || type == F_UNLCK)
>> >> >> >> goto out;
>> >> >> >>
>> >> >> >> - new = kmem_cache_alloc(flctx_cache, GFP_KERNEL);
>> >> >> >> - if (!new)
>> >> >> >> + ctx = kmem_cache_alloc(flctx_cache, GFP_KERNEL);
>> >> >> >> + if (!ctx)
>> >> >> >> goto out;
>> >> >> >>
>> >> >> >> - spin_lock_init(&new->flc_lock);
>> >> >> >> - INIT_LIST_HEAD(&new->flc_flock);
>> >> >> >> - INIT_LIST_HEAD(&new->flc_posix);
>> >> >> >> - INIT_LIST_HEAD(&new->flc_lease);
>> >> >> >> + spin_lock_init(&ctx->flc_lock);
>> >> >> >> + INIT_LIST_HEAD(&ctx->flc_flock);
>> >> >> >> + INIT_LIST_HEAD(&ctx->flc_posix);
>> >> >> >> + INIT_LIST_HEAD(&ctx->flc_lease);
>> >> >> >>
>> >> >> >> /*
>> >> >> >> * Assign the pointer if it's not already assigned. If it is, then
>> >> >> >> * free the context we just allocated.
>> >> >> >> */
>> >> >> >> - if (cmpxchg(&inode->i_flctx, NULL, new))
>> >> >> >> - kmem_cache_free(flctx_cache, new);
>> >> >> >> + if (cmpxchg(&inode->i_flctx, NULL, ctx)) {
>> >> >> >> + kmem_cache_free(flctx_cache, ctx);
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> + }
>> >> >> >> out:
>> >> >> >> - return inode->i_flctx;
>> >> >> >> + return ctx;
>> >> >> >> }
>> >> >> >>
>> >> >> >> void
>> >> >> >> @@ -762,7 +766,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
>> >> >> >> struct file_lock_context *ctx;
>> >> >> >> struct inode *inode = file_inode(filp);
>> >> >> >>
>> >> >> >> - ctx = inode->i_flctx;
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (!ctx || list_empty_careful(&ctx->flc_posix)) {
>> >> >> >> fl->fl_type = F_UNLCK;
>> >> >> >> return;
>> >> >> >> @@ -1203,7 +1207,7 @@ int locks_mandatory_locked(struct file *file)
>> >> >> >> struct file_lock_context *ctx;
>> >> >> >> struct file_lock *fl;
>> >> >> >>
>> >> >> >> - ctx = inode->i_flctx;
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (!ctx || list_empty_careful(&ctx->flc_posix))
>> >> >> >> return 0;
>> >> >> >>
>> >> >> >> @@ -1388,7 +1392,7 @@ any_leases_conflict(struct inode *inode, struct file_lock *breaker)
>> >> >> >> int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>> >> >> >> {
>> >> >> >> int error = 0;
>> >> >> >> - struct file_lock_context *ctx = inode->i_flctx;
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >> struct file_lock *new_fl, *fl, *tmp;
>> >> >> >> unsigned long break_time;
>> >> >> >> int want_write = (mode & O_ACCMODE) != O_RDONLY;
>> >> >> >> @@ -1400,6 +1404,7 @@ int __break_lease(struct inode *inode, unsigned int mode, unsigned int type)
>> >> >> >> new_fl->fl_flags = type;
>> >> >> >>
>> >> >> >> /* typically we will check that ctx is non-NULL before calling */
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (!ctx) {
>> >> >> >> WARN_ON_ONCE(1);
>> >> >> >> return error;
>> >> >> >> @@ -1494,9 +1499,10 @@ EXPORT_SYMBOL(__break_lease);
>> >> >> >> void lease_get_mtime(struct inode *inode, struct timespec *time)
>> >> >> >> {
>> >> >> >> bool has_lease = false;
>> >> >> >> - struct file_lock_context *ctx = inode->i_flctx;
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >> struct file_lock *fl;
>> >> >> >>
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (ctx && !list_empty_careful(&ctx->flc_lease)) {
>> >> >> >> spin_lock(&ctx->flc_lock);
>> >> >> >> if (!list_empty(&ctx->flc_lease)) {
>> >> >> >> @@ -1543,10 +1549,11 @@ int fcntl_getlease(struct file *filp)
>> >> >> >> {
>> >> >> >> struct file_lock *fl;
>> >> >> >> struct inode *inode = file_inode(filp);
>> >> >> >> - struct file_lock_context *ctx = inode->i_flctx;
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >> int type = F_UNLCK;
>> >> >> >> LIST_HEAD(dispose);
>> >> >> >>
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (ctx && !list_empty_careful(&ctx->flc_lease)) {
>> >> >> >> spin_lock(&ctx->flc_lock);
>> >> >> >> time_out_leases(file_inode(filp), &dispose);
>> >> >> >> @@ -1713,9 +1720,10 @@ static int generic_delete_lease(struct file *filp, void *owner)
>> >> >> >> struct file_lock *fl, *victim = NULL;
>> >> >> >> struct dentry *dentry = filp->f_path.dentry;
>> >> >> >> struct inode *inode = dentry->d_inode;
>> >> >> >> - struct file_lock_context *ctx = inode->i_flctx;
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >> LIST_HEAD(dispose);
>> >> >> >>
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (!ctx) {
>> >> >> >> trace_generic_delete_lease(inode, NULL);
>> >> >> >> return error;
>> >> >> >> @@ -2359,13 +2367,14 @@ out:
>> >> >> >> void locks_remove_posix(struct file *filp, fl_owner_t owner)
>> >> >> >> {
>> >> >> >> struct file_lock lock;
>> >> >> >> - struct file_lock_context *ctx = file_inode(filp)->i_flctx;
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >>
>> >> >> >> /*
>> >> >> >> * If there are no locks held on this file, we don't need to call
>> >> >> >> * posix_lock_file(). Another process could be setting a lock on this
>> >> >> >> * file at the same time, but we wouldn't remove that lock anyway.
>> >> >> >> */
>> >> >> >> + ctx = smp_load_acquire(&file_inode(filp)->i_flctx);
>> >> >> >> if (!ctx || list_empty(&ctx->flc_posix))
>> >> >> >> return;
>> >> >> >>
>> >> >> >> @@ -2389,7 +2398,7 @@ EXPORT_SYMBOL(locks_remove_posix);
>> >> >> >>
>> >> >> >> /* The i_flctx must be valid when calling into here */
>> >> >> >> static void
>> >> >> >> -locks_remove_flock(struct file *filp)
>> >> >> >> +locks_remove_flock(struct file *filp, struct file_lock_context *flctx)
>> >> >> >> {
>> >> >> >> struct file_lock fl = {
>> >> >> >> .fl_owner = filp,
>> >> >> >> @@ -2400,7 +2409,6 @@ locks_remove_flock(struct file *filp)
>> >> >> >> .fl_end = OFFSET_MAX,
>> >> >> >> };
>> >> >> >> struct inode *inode = file_inode(filp);
>> >> >> >> - struct file_lock_context *flctx = inode->i_flctx;
>> >> >> >>
>> >> >> >> if (list_empty(&flctx->flc_flock))
>> >> >> >> return;
>> >> >> >> @@ -2416,10 +2424,8 @@ locks_remove_flock(struct file *filp)
>> >> >> >>
>> >> >> >> /* The i_flctx must be valid when calling into here */
>> >> >> >> static void
>> >> >> >> -locks_remove_lease(struct file *filp)
>> >> >> >> +locks_remove_lease(struct file *filp, struct file_lock_context *ctx)
>> >> >> >> {
>> >> >> >> - struct inode *inode = file_inode(filp);
>> >> >> >> - struct file_lock_context *ctx = inode->i_flctx;
>> >> >> >> struct file_lock *fl, *tmp;
>> >> >> >> LIST_HEAD(dispose);
>> >> >> >>
>> >> >> >> @@ -2439,17 +2445,20 @@ locks_remove_lease(struct file *filp)
>> >> >> >> */
>> >> >> >> void locks_remove_file(struct file *filp)
>> >> >> >> {
>> >> >> >> - if (!file_inode(filp)->i_flctx)
>> >> >> >> + struct file_lock_context *ctx;
>> >> >> >> +
>> >> >> >> + ctx = smp_load_acquire(&file_inode(filp)->i_flctx);
>> >> >> >> + if (!ctx)
>> >> >> >> return;
>> >> >> >>
>> >> >> >> /* remove any OFD locks */
>> >> >> >> locks_remove_posix(filp, filp);
>> >> >> >>
>> >> >> >> /* remove flock locks */
>> >> >> >> - locks_remove_flock(filp);
>> >> >> >> + locks_remove_flock(filp, ctx);
>> >> >> >>
>> >> >> >> /* remove any leases */
>> >> >> >> - locks_remove_lease(filp);
>> >> >> >> + locks_remove_lease(filp, ctx);
>> >> >> >> }
>> >> >> >>
>> >> >> >> /**
>> >> >> >> @@ -2616,7 +2625,7 @@ void show_fd_locks(struct seq_file *f,
>> >> >> >> struct file_lock_context *ctx;
>> >> >> >> int id = 0;
>> >> >> >>
>> >> >> >> - ctx = inode->i_flctx;
>> >> >> >> + ctx = smp_load_acquire(&inode->i_flctx);
>> >> >> >> if (!ctx)
>> >> >> >> return;
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> > --
>> >> > Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
>> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
>>
>>
>>
>
>
> --
> Jeff Layton <jlayton@xxxxxxxxxxxxxxx>
--
Dmitry Vyukov, Software Engineer, dvyukov@xxxxxxxxxx
Google Germany GmbH, DienerstraÃe 12, 80331, MÃnchen
GeschÃftsfÃhrer: Graham Law, Christine Elizabeth Flores
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
sind, leiten Sie diese bitte nicht weiter, informieren Sie den
Absender und lÃschen Sie die E-Mail und alle AnhÃnge. Vielen Dank.
This e-mail is confidential. If you are not the right addressee please
do not forward it, please inform the sender, and please erase this
e-mail including any attachments. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/