Re: pmem and i_dio_count overhead

From: Jens Axboe
Date: Wed Apr 15 2015 - 14:27:44 EST

Next message: Peter Zijlstra: "Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer"
Previous message: Peter Hurley: "Re: [3.14.y][3.16.y-ckt][3.18.y][3.19.y][PATCH 1/1] n_tty: Fix read buffer overwrite when no newline"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04/03/2015 03:35 PM, Elliott, Robert (Server Storage) wrote:

Jens, one of your patches from October 2013 never made it
to the kernel, but would be beneficial for pmem. It helps
IOPS about 15%.

Original patch: https://lkml.org/lkml/2013/10/24/130

From Jens Axboe
Subject [PATCH 05/11] direct-io: only inc/dec inode->i_dio_count for file systems
Date Thu, 24 Oct 2013 10:25:58 +0100

We don't need truncate protection for block devices, so add a flag
bypassing this cache line dirtying twice for every IO. This easily
contributes to 5-10% of the CPU time on high IOPS O_DIRECT testing.

Here are perf top results while running fio to pmem devices
using memcpy with non-temporal load and store instructions:

20.54% [pmem] [k] pmem_do_bvec.isra.6 <the memcpy function>
10.13% [kernel] [k] do_blockdev_direct_IO
5.93% [kernel] [k] inode_dio_done
4.46% [kernel] [k] bio_endio
3.07% fio [.] get_io_u
2.08% fio [.] do_io

Inside do_blockdev_direct_io (10%), 60% of the time is spent
atomically incrementing i_dio_count:

│ static inline void atomic_inc(atomic_t *v)
│ {
│ asm volatile(LOCK_PREFIX "incl %0"
0.06 │ 225: lock incl 0x134(%r14)
│ atomic_inc(&inode->i_dio_count);
│
│ retval = 0;
│ sdio.blkbits = blkbits;
│ sdio.blkfactor = i_blkbits - blkbits;
│ sdio.block_in_file = offset >> blkbits;
60.31 │ mov -0x1d0(%rbp),%rdx
0.16 │ mov %r12d,%ecx
│ */
│ atomic_inc(&inode->i_dio_count);
│
│ retval = 0;
│ sdio.blkbits = blkbits;
│ sdio.blkfactor = i_blkbits - blkbits;
0.00 │ sub %r12d,%ebx
│ * Will be decremented at I/O completion time.
│ */
│ atomic_inc(&inode->i_dio_count);

inode_dio_done is taking all of its 5.8% time doing the
corresponding atomic_dec.

So, they're combining for 11.8% of the overall CPU time.
The problem is more atomic contention than cache line dirtying.

Applying your patch (changing the bitmask from 0x04 to
0x08, since 0x04 is taken now) eliminates those
instructions from perf top and improves the high IOPS
results by 5 to 15%.

Attr Copy Read IOPS Write IOPS
==== ==== ========= ==========
UC NT rd,wr 513 K 326 K
with the patch: 510 K 325 K

WB NT rd,wr 3.3 M 3.5 M
with the patch: 3.8 M 3.9 M

WC NT rd,wr 3.0 M 3.9 M
with the patch: 3.1 M 4.1 M

WT NT rd,wr 3.3 M 2.1 M
with the patch: 3.7 M 3.7 M

(there is some other test environment inconsistency
with WT writes - I don't think this change really
helped by 76%)

Just re-posted a cleaned up variant, forgot to CC you... You've got it in private email as well.

Yes, lets finally get this in! Andrew, we ended up bike shedding on this patch a lot this time, which is ultimately why it got dropped on the floor. I CC'ed you on the new submission as well.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Zijlstra: "Re: [PATCH V6 4/6] perf, x86: handle multiple records in PEBS buffer"
Previous message: Peter Hurley: "Re: [3.14.y][3.16.y-ckt][3.18.y][3.19.y][PATCH 1/1] n_tty: Fix read buffer overwrite when no newline"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]