On 03/26/2018 11:18 AM, Daniel Vetter wrote:I have implemented hotunplug and it works with zombie DRM devices as we discussed.
On Fri, Mar 23, 2018 at 05:54:49PM +0200, Oleksandr Andrushchenko wrote:Yes, this is what I implement now, e.g. I do not create
No, you must make sure that no dumb buffer can be seen by anyone elseMy apologies, but I found a few more things that look strange and shouldThank you for reviewing!
be cleaned up. Sorry for this iterative review approach, but I think we're
slowly getting there.
Cheers, DanielWill fix, thank you
---You can't drop the reference while you keep using the object, someone else
+static int xen_drm_drv_dumb_create(struct drm_file *filp,
+ÂÂÂÂÂÂÂ struct drm_device *dev, struct drm_mode_create_dumb *args)
+{
+ÂÂÂ struct xen_drm_front_drm_info *drm_info = dev->dev_private;
+ÂÂÂ struct drm_gem_object *obj;
+ÂÂÂ int ret;
+
+ÂÂÂ ret = xen_drm_front_gem_dumb_create(filp, dev, args);
+ÂÂÂ if (ret)
+ÂÂÂÂÂÂÂ goto fail;
+
+ÂÂÂ obj = drm_gem_object_lookup(filp, args->handle);
+ÂÂÂ if (!obj) {
+ÂÂÂÂÂÂÂ ret = -ENOENT;
+ÂÂÂÂÂÂÂ goto fail_destroy;
+ÂÂÂ }
+
+ÂÂÂ drm_gem_object_unreference_unlocked(obj);
might sneak in and destroy your object. The unreference always must be
last.
You are correct, I need to rework this code+The above also has another race: If you construct an object, then it must
+ÂÂÂ /*
+ÂÂÂÂ * In case of CONFIG_DRM_XEN_FRONTEND_CMA gem_obj is constructed
+ÂÂÂÂ * via DRM CMA helpers and doesn't have ->pages allocated
+ÂÂÂÂ * (xendrm_gem_get_pages will return NULL), but instead can provide
+ÂÂÂÂ * sg table
+ÂÂÂÂ */
+ÂÂÂ if (xen_drm_front_gem_get_pages(obj))
+ÂÂÂÂÂÂÂ ret = xen_drm_front_dbuf_create_from_pages(
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ drm_info->front_info,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ xen_drm_front_dbuf_to_cookie(obj),
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ args->width, args->height, args->bpp,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ args->size,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ xen_drm_front_gem_get_pages(obj));
+ÂÂÂ else
+ÂÂÂÂÂÂÂ ret = xen_drm_front_dbuf_create_from_sgt(
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ drm_info->front_info,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ xen_drm_front_dbuf_to_cookie(obj),
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ args->width, args->height, args->bpp,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ args->size,
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ xen_drm_front_gem_get_sg_table(obj));
+ÂÂÂ if (ret)
+ÂÂÂÂÂÂÂ goto fail_destroy;
+
be fully constructed by the time you publish it to the wider world. In gem
this is done by calling drm_gem_handle_create() - after that userspace can
get at your object and do nasty things with it in a separate thread,
forcing your driver to Oops if the object isn't fully constructed yet.
That means you need to redo this code here to make sure that the gem
object is fully set up (including pages and sg tables) _before_ anything
calls drm_gem_handle_create().
This probably means you also need to open-code the cma side, by firstAlthough I tend to avoid open-coding, but this seems the necessary measure
calling drm_gem_cma_create(), then doing any additional setup, and finally
doing the registration to userspace with drm_gem_handle_create as the very
last thing.
here
Alternativet is to do the pages/sg setup only when you create an fb (andNot sure this will work: nothing prevents you from attaching multiple FBs to
drop the pages again when the fb is destroyed), but that requires some
refcounting/locking in the driver.
a single dumb handle
So, not only ref-counting should be done here, but I also need to check if
the dumb buffer,
we are attaching to, has been created already
before it's fully created. If you don't register it in the file_priv idr
using drm_gem_handle_create, no one else can get at your buffer. Trying to
paper over this race from all the other places breaks the gem core code
design, and is also much more fragile.
any dumb handle until GEM is fully created. I was just
saying that alternative way when we do pages/sgt on FB
attach will not work in my case
Ok, probably this is because I'm looking at the driverSo, I will rework with open-coding some stuff from CMA helpersI think it's the same indirection we talked about last time, it still
Aside: There's still a lot of indirection and jumping around which makesProbably I am not sure of which indirection we are talking about, could you
the code a bit hard to follow.
please
specifically mark those annoying you?
annoys me. But it's still ok if you prefer this way I think :-)
from an editor, but you are from your mail client ;)
+This needs to be in the unplug code. Yes that means you'll have multiple
+static void xen_drm_drv_release(struct drm_device *dev)
+{
+ÂÂÂ struct xen_drm_front_drm_info *drm_info = dev->dev_private;
+ÂÂÂ struct xen_drm_front_info *front_info = drm_info->front_info;
+
+ÂÂÂ drm_atomic_helper_shutdown(dev);
+ÂÂÂ drm_mode_config_cleanup(dev);
+
+ÂÂÂ xen_drm_front_evtchnl_free_all(front_info);
+ÂÂÂ dbuf_free_all(&front_info->dbuf_list);
+
+ÂÂÂ drm_dev_fini(dev);
+ÂÂÂ kfree(dev);
+
+ÂÂÂ /*
+ÂÂÂÂ * Free now, as this release could be not due to rmmod, but
+ÂÂÂÂ * due to the backend disconnect, making drm_info hang in
+ÂÂÂÂ * memory until rmmod
+ÂÂÂÂ */
+ÂÂÂ devm_kfree(&front_info->xb_dev->dev, front_info->drm_info);
+ÂÂÂ front_info->drm_info = NULL;
+
+ÂÂÂ /* Tell the backend we are ready to (re)initialize */
+ÂÂÂ xenbus_switch_state(front_info->xb_dev, XenbusStateInitialising);
drm_devices floating around, but that's how hotplug works. That would also
mean that you need to drop the front_info pointer from the backend at
unplug time.
GreatI think so, yes.destroy the drm_device, but only mark the drm_connector as disconnectedI'll try to play with this:
when the xenbus backend is gone. But this half-half solution here where
you hotunplug the drm_device but want to keep it around still doesn't work
from a livetime pov.
on backend disconnect I will do the following:
ÂÂÂÂ drm_dev_unplug(dev)
ÂÂÂÂ xen_drm_front_evtchnl_free_all(front_info);
ÂÂÂÂ dbuf_free_all(&front_info->dbuf_list);
ÂÂÂÂ devm_kfree(&front_info->xb_dev->dev, front_info->drm_info);
ÂÂÂÂ front_info->drm_info = NULL;
ÂÂÂÂ xenbus_switch_state(front_info->xb_dev, XenbusStateInitialising);
on drm_driver.release callback:
ÂÂÂÂ drm_atomic_helper_shutdown(dev);
ÂÂÂÂ drm_mode_config_cleanup(dev);
ÂÂÂÂ drm_dev_fini(dev);
ÂÂÂÂ kfree(dev);
Does the above make sense?
 One nit: Since you need to call devm_kfree either pick aSure, I just copy-pasted from the existing patch with devm_
different struct device that has the correct lifetime, or switch to the
normal kmalloc/kfree versions.
so we can discuss
I have hotplug in the driver for connectors now, so no new UAPIThen you need to fix your userspace. You can't invent new uapi which willProbably misunderstanding comes from the fact that it is possible if backend+static struct xenbus_driver xen_driver = {I still don't understand why you have both the remove and fini versions of
+ÂÂÂ .ids = xen_driver_ids,
+ÂÂÂ .probe = xen_drv_probe,
+ÂÂÂ .remove = xen_drv_remove,
this. See other comments, I think the xenbus vs. drm_device lifetime stuff
still needs to be cleaned up some more. This shouldn't be that hard
really.
Or maybe I'm just totally misunderstanding this frontend vs. backend split
in xen, so if you have a nice gentle intro text for why that exists, it
might help.
dies it may still have its XenBus state set to connected, thus
displback_disconnect callback will never be called. For that reason on rmmod
I call fini for the DRM driver to destroy it.
That was the idea to allow dummy user-space to get an error in+ÂÂÂ /*I'm pretty sure this doesn't work. Especially the check in display_check
+ÂÂÂÂ * pflip_timeout is set to current jiffies once we send a page flip and
+ÂÂÂÂ * reset to 0 when we receive frame done event from the backed.
+ÂÂÂÂ * It is checked during drm_connector_helper_funcs.detect_ctx to detect
+ÂÂÂÂ * time-outs for frame done event, e.g. due to backend errors.
+ÂÂÂÂ *
+ÂÂÂÂ * This must be protected with front_info->io_lock, so races between
+ÂÂÂÂ * interrupt handler and rest of the code are properly handled.
+ÂÂÂÂ */
+ÂÂÂ unsigned long pflip_timeout;
+
+ÂÂÂ bool conn_connected;
confuses me, if there's ever an error then you'll never ever be able to
display anything again, except when someone disables the display.
display_check and close, going through display_disable.
Yes, compositors will die in this case.
If you want to signal errors with the output then this must be doneUnfortunatelly, there is little software available which will benefit
through the new link-status property and
drm_mode_connector_set_link_status_property. Rejecting kms updates in
display_check with -EINVAL because the hw has a temporary issue is kinda
not cool (because many compositors just die when this happens). I thought
we agreed already to remove that, sorry for not spotting that in the
previous version.
from this out of the box. I am specifically interested in embedded
use-cases, e.g. Android (DRM HWC2 - doesn't support hotplug, HWC1.4 doesn't
support link status), Weston (no device hotplug, but connectors and
outputs).
Other software, like kmscube, modetest will not handle that as well.
So, such software will hang forever until killed.
break existing compositors like this.
Also I thought you've fixed theI do, I was just saying that modetest/kmscube doesn't
"hangs forever" by sending out the uevent in case the backend disappears
or has an error. That's definitely something that should be fixed, current
userspace doesn't expect that events never get delivered.
handle hotplug events, so they can't understand that the
connector is gone
Good, during the development I am probably seeing same
Yes. Well, as soon as Noralf's work has landed they'll switch to aSome of the conn_connected checks also look a bit like they should beI believe you are talking about drm_simple_display_pipe_funcs?
replaced by drm_dev_is_unplugged instead, but I'm not sure.
Do you mean I have to put drm_dev_is_unplugged in display_enable,
display_disable and display_update callbacks?
drm_dev_enter/exit pair, but same idea.
races because of this, e.g. I only have drm_dev_is_unplugged
as my tool which is not enough
Will doYes, please create your own timer/worker for this, stuffing random otherOk, will have a dedicated work for that. The reasons why I put this into the+static int connector_detect(struct drm_connector *connector,If you want to check for timeouts please use a worker, don't piggy-pack on
+ÂÂÂÂÂÂÂ struct drm_modeset_acquire_ctx *ctx,
+ÂÂÂÂÂÂÂ bool force)
+{
+ÂÂÂ struct xen_drm_front_drm_pipeline *pipeline =
+ÂÂÂÂÂÂÂÂÂÂÂ to_xen_drm_pipeline(connector);
+ÂÂÂ struct xen_drm_front_info *front_info = pipeline->drm_info->front_info;
+ÂÂÂ unsigned long flags;
+
+ÂÂÂ /* check if there is a frame done event time-out */
+ÂÂÂ spin_lock_irqsave(&front_info->io_lock, flags);
+ÂÂÂ if (pipeline->pflip_timeout &&
+ÂÂÂÂÂÂÂÂÂÂÂ time_after_eq(jiffies, pipeline->pflip_timeout)) {
+ÂÂÂÂÂÂÂ DRM_ERROR("Frame done event timed-out\n");
+
+ÂÂÂÂÂÂÂ pipeline->pflip_timeout = 0;
+ÂÂÂÂÂÂÂ pipeline->conn_connected = false;
+ÂÂÂÂÂÂÂ xen_drm_front_kms_send_pending_event(pipeline);
+ÂÂÂ }
+ÂÂÂ spin_unlock_irqrestore(&front_info->io_lock, flags);
top of the detect callback.
detect callback were:
- the periodic worker is already there, and I do nothing heavy
ÂÂ in this callback
- if frame done has timed out it most probably means that
ÂÂ backend has gone, so 10 sec period of detect timeout is also ok: thus I
don't
ÂÂ need to schedule a work each page flip which could be a bit costly
So, probably I will also need a periodic work (or kthread/timer) for frame
done time-outs
things into existing workers makes the locking hierarchy more complicated
for everyone. And it's confusing for core devs trying to understand what
your driver does :-)
Just to make sure we are on the same page: I just move connector_mode_valid
Most drivers have piles of timers/workers doing various stuff, they're
real cheap.
No, you still need your mode_valid check. Userspace can ignore your modeAgree, I will remove this callback completely: I have+static int connector_mode_valid(struct drm_connector *connector,mode_valid on the connector only checks probe modes. Since that is
+ÂÂÂÂÂÂÂ struct drm_display_mode *mode)
+{
+ÂÂÂ struct xen_drm_front_drm_pipeline *pipeline =
+ÂÂÂÂÂÂÂÂÂÂÂ to_xen_drm_pipeline(connector);
+
+ÂÂÂ if (mode->hdisplay != pipeline->width)
+ÂÂÂÂÂÂÂ return MODE_ERROR;
+
+ÂÂÂ if (mode->vdisplay != pipeline->height)
+ÂÂÂÂÂÂÂ return MODE_ERROR;
+
+ÂÂÂ return MODE_OK;
+}
hardcoded this doesn't do much, which means userspace can give you a wrong
mode, and you fall over.
drm_connector_funcs.fill_modes == drm_helper_probe_single_connector_modes,
so it will only pick my single hardcoded mode from
drm_connector_helper_funcs.get_modes
callback (connector_get_modes).
list and give you something totally different. But it needs to be moved to
the drm_simple_display_pipe_funcs vtable.
as is to drm_simple_display_pipe_funcs, right?
It is all clearPlease read the kerneldoc again, userspace can give you modes that are notYou need to use one of the other mode_valid callbacks instead,Not sure I understand why do I need to provide a callback here?
drm_simple_display_pipe_funcs has the one you should use.
For simple KMS the drm_simple_kms_crtc_mode_valid callback is used,
which always returns MODE_OK if there is no .mode_valid set for the pipe.
As per my understanding drm_simple_kms_crtc_mode_valid is only called for
modes, which were collected by drm_helper_probe_single_connector_modes,
so I assume each time .validate_mode is called it can only have my hardcoded
mode to validate?
coming from drm_helper_probe_single_connector_modes. If the kerneldoc
isn't clear, then please submit a patch to make it clearer.
In ideal world - yes, we have to fix existing software ;)Fix your userspace. Again, you can't invent new uapi like this which endsNot sure how mode_valid is relevant to this code [1]: This function is+As mentioned, this -EINVAL here needs to go. Since you already have a
+static int display_check(struct drm_simple_display_pipe *pipe,
+ÂÂÂÂÂÂÂ struct drm_plane_state *plane_state,
+ÂÂÂÂÂÂÂ struct drm_crtc_state *crtc_state)
+{
+ÂÂÂ struct xen_drm_front_drm_pipeline *pipeline =
+ÂÂÂÂÂÂÂÂÂÂÂ to_xen_drm_pipeline(pipe);
+
+ÂÂÂ return pipeline->conn_connected ? 0 : -EINVAL;
mode_valid callback you can (should) drop this one here entirely.
called
in the check phase of an atomic update, specifically when the underlying
plane is checked. But, anyways: the reason for this callback and it
returning
-EINVAL is primarialy for a dumb user-space which cannot handle hotplug
events.
up being inconsistent with other existing userspace.
Thank you,
But, as you mentioned before, it will make most compositors die, so I willYup, sounds good.
remove this
Cheers, Daniel
Oleksandr