Re: i915 regression in kernel 4.10

From: Juergen Gross
Date: Mon Dec 19 2016 - 09:17:06 EST


On 19/12/16 13:29, Chris Wilson wrote:
> On Mon, Dec 19, 2016 at 12:39:16PM +0100, Juergen Gross wrote:
>> With recent 4.10 kernel the graphics isn't coming up under Xen. First
>> failure message is:
>>
>> [ 46.656649] i915 0000:00:02.0: swiotlb buffer is full (sz: 1630208 bytes)
>
> Do we get a silent failure? i915_gem_gtt_prepare_pages() is where we
> call dma_map_sg() and pass the sg to swiotlb (in this case) for
> remapping, and we do check for an error value of 0. After that error,
> SWIOTLB_MAP_ERROR is propagated back and converted to 0 for
> dma_map_sg(). That looks valid, and we should report ENOMEM back to the
> caller.
>
>> Later I see splats like:
>>
>> [ 49.393583] general protection fault: 0000 [#1] SMP
>
> What was the faulting address? RAX is particularly non-pointer-like so I
> wonder if we walked onto an uninitialised portion of the sgtable. We may
> have tripped over a bug in our sg_page iterator.

During the bisect process there have been either GP or NULL pointer
dereferences or other page faults. Typical addresses where:

xen_swiotlb_unmap_sg_attrs+0x1f/0x50: access to 0000000000000018
xen_swiotlb_unmap_sg_attrs+0x1f/0x50: access to 0000000003020118

>
> The attached patch should prevent an early ENOMEM following the swiotlb
> allocation failure. But I suspect that we will still be tripping up the
> failure in the sg walker when binding to the GPU.
> -Chris
>

The patch is working not too bad. :-)

Still several "swiotlb buffer is full" messages (some with sz:, most
without), but no faults any more (neither GP nor NULL pointer
dereference). Graphical login is working now.

What I do see, however, is (no idea whether this is related):

[ 735.826492] INFO: task systemd-udevd:484 blocked for more than 120
seconds.
[ 735.826497] Tainted: G W 4.9.0-pv+ #767
[ 735.826499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 735.826501] systemd-udevd D 0 484 443 0x00000000
[ 735.826507] Call Trace:
[ 735.826522] ? __schedule+0x192/0x640
[ 735.826530] ? kmem_cache_free+0x45/0x150
[ 735.826535] ? schedule+0x2d/0x80
[ 735.826539] ? schedule_timeout+0x1f3/0x380
[ 735.826545] ? error_exit+0x9/0x20
[ 735.826555] ? sg_pool_index.part.0+0x2/0x2
[ 735.826561] ? wait_for_completion+0xa4/0x110
[ 735.826569] ? wake_up_q+0x70/0x70
[ 735.826577] ? cpufreq_boost_online+0x10/0x10 [acpi_cpufreq]
[ 735.826585] ? cpuhp_issue_call+0x9c/0xe0
[ 735.826590] ? __cpuhp_setup_state+0xd5/0x1d0
[ 735.826599] ? acpi_cpufreq_init+0x1cd/0x1000 [acpi_cpufreq]
[ 735.826601] ? 0xffffffffa00b1000
[ 735.826607] ? do_one_initcall+0x38/0x180
[ 735.826611] ? kmem_cache_alloc_trace+0x98/0x1e0
[ 735.826620] ? do_init_module+0x55/0x1e5
[ 735.826629] ? load_module+0x2088/0x26b0
[ 735.826633] ? __symbol_put+0x30/0x30
[ 735.826639] ? SYSC_finit_module+0x80/0xb0
[ 735.826644] ? entry_SYSCALL_64_fastpath+0x1e/0xad

I guess it is _not_ related, OTOH there is sg_pool_index() involved...


Juergen