Re: [PATCH 4/7] gpu: nova-core: falcon: use dma::Coherent

From: Alexandre Courbot

Date: Sat Mar 28 2026 - 09:03:33 EST

On Fri Mar 27, 2026 at 12:35 AM JST, Gary Guo wrote:
> On Thu Mar 26, 2026 at 3:04 PM GMT, Alexandre Courbot wrote:
>> On Wed Mar 25, 2026 at 11:14 AM JST, Eliot Courtney wrote:
>>> On Sat Mar 21, 2026 at 10:36 PM JST, Alexandre Courbot wrote:
>>>> Replace the nova-core local `DmaObject` with a `Coherent` that can
>>>> fulfill the same role.
>>>>
>>>> Signed-off-by: Alexandre Courbot <acourbot@xxxxxxxxxx>
>>>> ---
>>>> drivers/gpu/nova-core/falcon.rs | 6 +++---
>>>> 1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/nova-core/falcon.rs b/drivers/gpu/nova-core/falcon.rs
>>>> index 5bf8da8760bf..f6239c44dd80 100644
>>>> --- a/drivers/gpu/nova-core/falcon.rs
>>>> +++ b/drivers/gpu/nova-core/falcon.rs
>>>> @@ -10,6 +10,7 @@
>>>> Device, //
>>>> },
>>>> dma::{
>>>> + Coherent,
>>>> DmaAddress,
>>>> DmaMask, //
>>>> },
>>>> @@ -20,7 +21,6 @@
>>>> };
>>>>
>>>> use crate::{
>>>> - dma::DmaObject,
>>>> driver::Bar0,
>>>> falcon::hal::LoadMethod,
>>>> gpu::Chipset,
>>>> @@ -636,7 +636,7 @@ pub(crate) fn pio_load<F: FalconFirmware<Target = E> + FalconPioLoadable>(
>>>> fn dma_wr(
>>>> &self,
>>>> bar: &Bar0,
>>>> - dma_obj: &DmaObject,
>>>> + dma_obj: &Coherent<[u8]>,
>>>> target_mem: FalconMem,
>>>> load_offsets: FalconDmaLoadTarget,
>>>> ) -> Result {
>>>> @@ -740,7 +740,7 @@ fn dma_load<F: FalconFirmware<Target = E> + FalconDmaLoadable>(
>>>> fw: &F,
>>>> ) -> Result {
>>>> // Create DMA object with firmware content as the source of the DMA engine.
>>>> - let dma_obj = DmaObject::from_data(dev, fw.as_slice())?;
>>>> + let dma_obj = Coherent::from_slice(dev, fw.as_slice(), GFP_KERNEL)?;
>>>
>>> Is it guaranteed that fw.as_slice() is a multiple of 256 in size?
>>> In `dma_wr` it breaks this up into 256 byte transfers. Since this
>>> no longer pads out to a page boundary, it means that it could now error
>>> (around "DMA transfer goes beyond range of DMA object") if the Dmem
>>> section's size is not divisible by 256. But tbh, I find it odd that
>>> `dma_wr` doesn't check that FalconDmaLoadTarget's length is a
>>> multiple of 256 anyway, because it looks like it'll write a bunch of
>>> unrelated bytes (since it rounds up to the nearest 256 to copy).
>>>
>>> Maybe we should enforce that `FalconDmaLoadTarget` length is divisible
>>> by 256?
>>>
>>> For this series if for all firmwares it's divisible by 256 then I think
>>> it's fine to leave this as is for now, but I do find the lack of
>>> checking in `dma_wr` (or anywhere else for FalconDmaLoadTarget) a bit
>>> odd.
>>
>> All coherent allocations are page-aligned (and use full pages), so we
>> are safe in terms of overflows.
>
> Let's not rely on this behaviour. There is no guarantee on what's at the end
> of allocation whatsoever. There's no guarantee that it will be initialized.
> Even with __GFP_ZERO only the size provided will be zeroed.
>
> If the GPU is going to read beyond ranges covered by `Coherent` (not just rely
> on the alignment), let's align up the allocation.
>
>>
>> Also `dma_wr` uses `div_ceil(256)` which will skip the last data block
>> entirely if it is not a multiple of 256. It might be a bit more robust
>> to explicitly check that the size is a multiple of 256 and return an
>> error if that is not the case indeed.
>
> div_ceil will not skip the last block, it will over-read beyond the end.
> div_floor would have skipped the block.

Ooopsie yes, of course. Making `dma_wr` check that the data is a
multiple of 256 is the simplest, I'll send a patch for that (with maybe
some padding code as I think I remember Turing at least did not always
follow the 256-alignment requirement).