Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

From: John Hubbard
Date: Mon May 03 2021 - 14:17:38 EST


On 5/3/21 8:57 AM, Logan Gunthorpe wrote:


On 2021-05-01 9:58 p.m., John Hubbard wrote:
Another odd thing: this used to check for memory failure and just give
up, and now it doesn't. Yes, I realize that it all still works at the
moment, but this is quirky and we shouldn't stop here.

Instead, a cleaner approach would be to push the memory allocation
slightly higher up the call stack, out to the
pci_p2pdma_distance_many(). So pci_p2pdma_distance_many() should make
the kmalloc() call, and fail out if it can't get a page for the seq_buf
buffer. Then you don't have to do all this odd stuff.

I don't really agree with this assessment. If kmalloc fails to
initialize the seq_buf() (which should be very rare), the only thing
that is lost is the one warning print that tells the user the command
line parameter needed disable the ACS. Everything else works fine,
nothing else can fail. I don't see the need to add extra complexity just
so the code errors out in no-mem instead of just skipping the one,
slightly more informative, warning line.

That's the thing: memory failure should be exceedingly rare for this.
Therefore, just fail out entirely (which I don't expect we'll likely
ever see), instead of doing all this weird stuff to try to continue
on if you cannot allocate a single page. If you are in that case, the
system is not in a state that is going to run your dma p2p setup well
anyway.

I think it's *less* complexity to allocate up front, fail early if
allocation fails, and then not have to deal with these really odd
quirks at the lower levels.


Also, keep in mind the result of all these functions are cached so it
only ever happens once. So for this to matter, the user would have to do
their first transaction between two devices exactly at the time memory
allocations would fail.


Furthermore, the call sites can then decide for themselves which GFP
flags, GFP_ATOMIC or GFP_KERNEL or whatever they want for kmalloc().

A related thing: this whole exercise would go better if there were a
preparatory patch or two that changed the return codes in this file to
something less crazy. There are too many functions that can fail, but
are treated as if they sort-of-mostly-would-never-fail, in the hopes of
using the return value directly for counting and such. This is badly
mistaken, and it leads developers to try to avoid returning -ENOMEM
(which is what we need here).

Hmm? Which functions can fail? and how?


Let's defer that to the other patches, I was sort of looking ahead to
those, sorry.

thanks,
--
John Hubbard
NVIDIA