Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

From: John Hubbard
Date: Tue Mar 12 2019 - 20:39:30 EST


On 3/12/19 8:30 AM, Ira Weiny wrote:
On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubbard@xxxxxxxxx wrote:
From: John Hubbard <jhubbard@xxxxxxxxxx>

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().

So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] ------------[ cut here ]------------
[ 1355.563436] get_user_pages pin count overflowed

Hi Ira,

Thanks for reporting this. That overflow, at face value, means that we've
used more than the 22 bits worth of gup pin counts, so about 4 million pins
of the same page...

[ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 get_gup_pin_page+0xa5/0xb0
[ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transpo
rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser rdma_cm ib_umad iw_cm libiscs
i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbyp
ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul ledtrig_audio snd_hda_intel
crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel crypto_simd s
nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si device_dax nd_btt ioatdma nd
_e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support libnvdimm pcspkr lpc_ich mei_m
e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod mgag
200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core ttm crc32c_intel igb isci ah
ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
[ 1355.577429] firewire_core crc_itu_t i2c_core libata dm_mod [last unloaded: rdmavt]
[ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
[ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.02.04.0003.1023201411
38 10/23/2014
[ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
[ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 bb 48 c7 c7 48 0a e9 81 89 4
4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 00 66 66 66 66 90 41 57 49
bf 00 00
[ 1355.733244] RSP: 0018:ffffc90005a23b30 EFLAGS: 00010286
[ 1355.739536] RAX: 0000000000000000 RBX: ffffea0014220000 RCX: 0000000000000000
[ 1355.748005] RDX: 0000000000000003 RSI: ffffffff827d94a3 RDI: 0000000000000246
[ 1355.756453] RBP: ffffea0014220000 R08: 0000000000000002 R09: 0000000000022400
[ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0000000000000001 R12: 0000000000010207
[ 1355.773369] R13: ffff8884130b7040 R14: fff0000000000fff R15: 000fffffffe00000
[ 1355.781836] FS: 00007f2680d0d740(0000) GS:ffff88842e840000(0000) knlGS:0000000000000000
[ 1355.791384] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1355.798319] CR2: 0000000000589000 CR3: 000000040b05e004 CR4: 00000000000606e0
[ 1355.806809] Call Trace:
[ 1355.810078] follow_page_pte+0x4f3/0x5c0
[ 1355.814987] __get_user_pages+0x1eb/0x730
[ 1355.820020] get_user_pages+0x3e/0x50
[ 1355.824657] ib_umem_get+0x283/0x500 [ib_uverbs]
[ 1355.830340] ? _cond_resched+0x15/0x30
[ 1355.835065] mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
[ 1355.841235] ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
[ 1355.847400] ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
[ 1355.853473] __vfs_write+0x36/0x1b0
[ 1355.857904] ? selinux_file_permission+0xf0/0x130
[ 1355.863702] ? security_file_permission+0x2e/0xe0
[ 1355.869503] vfs_write+0xa5/0x1a0
[ 1355.873751] ksys_write+0x4f/0xb0
[ 1355.878009] do_syscall_64+0x5b/0x180
[ 1355.882656] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1355.888862] RIP: 0033:0x7f2680ec3ed8
[ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 45 78 0
d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49
89 d4 55
[ 1355.915573] RSP: 002b:00007ffe65d50bc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1355.924621] RAX: ffffffffffffffda RBX: 00007ffe65d50c74 RCX: 00007f2680ec3ed8
[ 1355.933195] RDX: 0000000000000030 RSI: 00007ffe65d50c80 RDI: 0000000000000003
[ 1355.941760] RBP: 0000000000000030 R08: 0000000000000007 R09: 0000000000581260
[ 1355.950326] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000581930
[ 1355.958885] R13: 000000000000000c R14: 0000000000581260 R15: 0000000000000000
[ 1355.967430] ---[ end trace bc771ac6189977a2 ]---


I'm not sure what I did to do this and I'm going to work on a reproducer. At
the time of the Warning I only had 1 GUP user?!?!?!?!

If there is a get_user_pages() call that lacks a corresponding put_user_pages()
call, then the count could start working its way up, and up. Either that, or a
bug in my patches here, could cause this. The basic counting works correctly
in fio runs on an NVMe driver with Direct IO, when I dump out
`cat /proc/vmstat | grep gup`: the counts match up, but that is a simple test.

One way to force a faster repro is to increase the GUP_PIN_COUNTING_BIAS, so
that the gup pin count runs into the max much sooner.

I'd really love to create a test setup that would generate this failure, so
anything you discover on how to repro (including what hardware is required--I'm
sure I can scrounge up some IB gear in a pinch) is of great interest.

Also, I'm just now starting on the DEBUG_USER_PAGE_REFERENCES idea that Jerome,
Jan, and Dan floated some months ago. It's clearly a prerequisite to converting
the call sites properly--just our relatively small IB driver is showing that.
This feature will provide a different mapping of the struct pages, if get
them via get_user_pages(). That will allow easily asserting that put_user_page()
and put_page() are not swapped, in either direction.



I'm not using ODP, so I don't think the changes we have discussed there are a
problem.

Ira



thanks,
--
John Hubbard
NVIDIA