RE: [PATCH][drm-next] drm/i915/gvt: ensure -ve return value is handled correctly

From: Wang, Zhi A
Date: Fri Sep 22 2017 - 13:50:56 EST


Thanks for the reply. Learned a lot. :)

GEM_BUG_ON is new to me since it wasn't there at the beginning of GVT-g upstream. It showed up later. So I left a lot of WARN_ON in the code and some of them should be GEM_BUG_ON now.

Now I can figure out those differences. We can discuss with our QA to see if they would like to enable I915_GEM_DEBUG and then we can move to GEM_BUG_ON also, or maybe we can have a dedicated GVT_BUG_ON. :) Thank you so much. Have a great weekend.

Thanks,
Zhi.

-----Original Message-----
From: Joonas Lahtinen [mailto:joonas.lahtinen@xxxxxxxxxxxxxxx]
Sent: Friday, September 22, 2017 2:11 PM
To: Wang, Zhi A <zhi.a.wang@xxxxxxxxx>; Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx>; Joe Perches <joe@xxxxxxxxxxx>
Cc: Gao, Fred <fred.gao@xxxxxxxxx>; David Airlie <airlied@xxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; kernel-janitors@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Vivi, Rodrigo <rodrigo.vivi@xxxxxxxxx>; Colin King <colin.king@xxxxxxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH][drm-next] drm/i915/gvt: ensure -ve return value is handled correctly

On Thu, 2017-09-21 at 16:17 +0000, Wang, Zhi A wrote:
> Hi Joonas:
>
> Thanks for the introduction. I have been thinking about the
> possibility of introducing GEM_BUG_ON into GVT-g recently and
> investigating on it. I'm just a bit confused about the usage between
> GEM_BUG_ON and WARN_ON.

GEM_BUG_ON is basically there to catch things that we do not expect ever to happen within the driver. So we often list the function preconditions as GEM_BUG_ON. It's there for the same reason as the lockdep_assert_held and KASAN. It's sometimes heavy checks that we really want to run when functionally validating kernel.

GEM_BUG_ON became to existence because adding checks for obvious conditions at the critical command submission path GEM is not sustainable for performance in production.

The expectation is that each GEM_BUG_ON has a testcase in I-G-T that has the potential to hit it if driver was modified not to respect those preconditions. So once our testest passes, we can disable the GEM_BUG_ONs and be confident of the internal driver quality and get the release performance.

WARN_ON is mostly used for the cases when the hardware is behaving differently than we expect. We can't remove them as we don't have all the hardware in the world to test, but we try to exercise them too through I-G-Ts. The test will often be the subtest that was written to reproduce the problem with our expectations of hardware in case of hangs and other bugs. After we've corrected the driver behaviour, or got a hardware W/A assigned, we keep the test and add a WARN_ON to make sure there will be no regression back to the same situation.

This is at least what should happen, given time constraints, there may be variations.

User behaving unexpectedly should never result in WARN_ON (or even worse, BUG_ON), should always just be debug messages displayed (not to trigger the CI) and errors propagated back to user:

https://01.org/linuxgraphics/gfx-docs/drm/gpu/drm-uapi.html#recommended
-ioctl-return-values

Bare BUG_ON should only be used when there's the danger of corrupting system memory or filesystems, so from graphics driver, that's not very often. Controlled propagation of errors and maybe WARN_ON is always preferred if possible.


> GEM_BUG_ON is only enabled when kernel debug is enabled, which mostly
> is disabled in a production kernel. In the case of i915, I'm sure it
> will be enabled in CI test so that it can catch broken code path.
> Looking into GVT-g, the similar scenario is we enable it in QA test.
>
> Let's say GEM_BUG_ON can do its work very well in QA test but QA test
> is not fully covered all the condition, then something might be still
> broken when it comes to the production kernel for user and GEM_BUG_ON
> will be disabled and will not catch that, I guess.
>
> That's my confusion which scratched my mind during the investigation:
> If GEM_BUG_ON is not always working, then it looks WARN_ON should
> always be used.... Expected to learn more about the story behind. :)

So if the saying is some object is "never going to be bigger than 2G", there should be either:

1. GEM_BUG_ON like assertion for it and a test that tries to hit it, by trying to allocate a huge object for example, and should get rejection as -EINVAL

2. Test to see if the object is bigger, and propagate back the error if it is. Either resulting in user reported error if the origin of the object is outside of kernel <-> hardware. Or a WARN_ON if it's strange hardware or kernel driver behavior.

You should choose depending on how often your function gets called, and how critical the execution time is.

Hopefully this clarified things.

Regards, Joonas
--
Joonas Lahtinen
Open Source Technology Center
Intel Corporation