RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64

From: Haiyang Zhang
Date: Tue Jun 11 2024 - 13:44:00 EST


(resending in plain text)

> -----Original Message-----
> From: Michael Kelley <mailto:mhklinux@xxxxxxxxxxx>
> Sent: Tuesday, June 11, 2024 12:35 PM
> To: Haiyang Zhang <mailto:haiyangz@xxxxxxxxxxxxx>; mailto:linux-hyperv@xxxxxxxxxxxxxxx;
> mailto:netdev@xxxxxxxxxxxxxxx
> Cc: Dexuan Cui <mailto:decui@xxxxxxxxxxxxx>; mailto:stephen@xxxxxxxxxxxxxxxxxx; KY
> Srinivasan <mailto:kys@xxxxxxxxxxxxx>; Paul Rosswurm <mailto:paulros@xxxxxxxxxxxxx>;
> mailto:olaf@xxxxxxxxx; vkuznets <mailto:vkuznets@xxxxxxxxxx>; mailto:davem@xxxxxxxxxxxxx;
> mailto:wei.liu@xxxxxxxxxx; mailto:edumazet@xxxxxxxxxx; mailto:kuba@xxxxxxxxxx;
> mailto:pabeni@xxxxxxxxxx; mailto:leon@xxxxxxxxxx; Long Li <mailto:longli@xxxxxxxxxxxxx>;
> mailto:ssengar@xxxxxxxxxxxxxxxxxxx; mailto:linux-rdma@xxxxxxxxxxxxxxx;
> mailto:daniel@xxxxxxxxxxxxx; mailto:john.fastabend@xxxxxxxxx; mailto:bpf@xxxxxxxxxxxxxxx;
> mailto:ast@xxxxxxxxxx; mailto:hawk@xxxxxxxxxx; mailto:tglx@xxxxxxxxxxxxx;
> mailto:shradhagupta@xxxxxxxxxxxxxxxxxxx; mailto:linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: [PATCH net-next] net: mana: Add support for variable page
> sizes of ARM64
>
> From: LKML haiyangz <mailto:lkmlhyz@xxxxxxxxxxxxx> On Behalf Of Haiyang Zhang
> Sent: Monday, June 10, 2024 2:23 PM
> >
> > As defined by the MANA Hardware spec, the queue size for DMA is 4KB
> > minimal, and power of 2.
>
> You say the hardware requires 4K "minimal". But the definitions in this
> patch hardcode to 4K, as if that's the only choice. Is the hardcoding to
> 4K a design decision made to simplify the MANA driver?

The HWC q size has to be exactly 4k, which is by HW design.
Other "regular" queues can be 2^n >= 4k.

>
> > To support variable page sizes (4KB, 16KB, 64KB) of ARM64, define
>
> A minor nit, but "variable" page size doesn't seem like quite the right
> description -- both here and in the Subject line.  On ARM64, the page
> size
> is a choice among a few fixed options.  Perhaps call it support for "page
> sizes
> other than 4K"?

"page sizes other than 4K" sounds good.

>
> > the minimal queue size as a macro separate from the PAGE_SIZE, which
> > we always assumed it to be 4KB before supporting ARM64.
> > Also, update the relevant code related to size alignment, DMA region
> > calculations, etc.
> >
> > Signed-off-by: Haiyang Zhang <mailto:haiyangz@xxxxxxxxxxxxx>
> > ---
> >  drivers/net/ethernet/microsoft/Kconfig        |  2 +-
> >  .../net/ethernet/microsoft/mana/gdma_main.c   |  8 +++----
> >  .../net/ethernet/microsoft/mana/hw_channel.c  | 22 +++++++++----------
> >  drivers/net/ethernet/microsoft/mana/mana_en.c |  8 +++----
> >  .../net/ethernet/microsoft/mana/shm_channel.c |  9 ++++----
> >  include/net/mana/gdma.h                       |  7 +++++-
> >  include/net/mana/mana.h                       |  3 ++-
> >  7 files changed, 33 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/Kconfig
> > b/drivers/net/ethernet/microsoft/Kconfig
> > index 286f0d5697a1..901fbffbf718 100644
> > --- a/drivers/net/ethernet/microsoft/Kconfig
> > +++ b/drivers/net/ethernet/microsoft/Kconfig
> > @@ -18,7 +18,7 @@ if NET_VENDOR_MICROSOFT
> >  config MICROSOFT_MANA
> >  tristate "Microsoft Azure Network Adapter (MANA) support"
> >  depends on PCI_MSI
> > - depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)
> > + depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
> >  depends on PCI_HYPERV
> >  select AUXILIARY_BUS
> >  select PAGE_POOL
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 1332db9a08eb..c9df942d0d02 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context *gc,
> > unsigned int length,
> >  dma_addr_t dma_handle;
> >  void *buf;
> >
> > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> >         return -EINVAL;
> >
> >  gmi->dev = gc->dev;
> > @@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region,
> > NET_MANA);
> >  static int mana_gd_create_dma_region(struct gdma_dev *gd,
> >                          struct gdma_mem_info *gmi)
> >  {
> > - unsigned int num_page = gmi->length / PAGE_SIZE;
> > + unsigned int num_page = gmi->length / MANA_MIN_QSIZE;
>
> This calculation seems a bit weird when using MANA_MIN_QSIZE. The
> number of pages, and the construction of the page_addr_list array
> a few lines later, seem unrelated to the concept of a minimum queue
> size. Is the right concept really a "mapping chunk", and num_page
> would conceptually be "num_chunks", or something like that?  Then
> a queue must be at least one chunk in size, but that's derived from the
> chunk size, and is not the core concept.

I think calling it "num_chunks" is fine.
May I use "num_chunks" in next version?

>
> Another approach might be to just call it "MANA_PAGE_SIZE", like
> has been done with HV_HYP_PAGE_SIZE.  HV_HYP_PAGE_SIZE exists to
> handle exactly the same issue of the guest PAGE_SIZE potentially
> being different from the fixed 4K size that must be used in host-guest
> communication on Hyper-V.  Same thing here with MANA.

I actually called it "MANA_PAGE_SIZE" in my previous internal patch.
But Paul from Hostnet team opposed using that name, because
4kB is the min q size. MANA doesn't have "page" at HW level.


> >  struct gdma_create_dma_region_req *req = NULL;
> >  struct gdma_create_dma_region_resp resp = {};
> >  struct gdma_context *gc = gd->gdma_context;
> > @@ -727,7 +727,7 @@ static int mana_gd_create_dma_region(struct
> gdma_dev *gd,
> >  int err;
> >  int i;
> >
> > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> >         return -EINVAL;
> >
> >  if (offset_in_page(gmi->virt_addr) != 0)
> > @@ -751,7 +751,7 @@ static int mana_gd_create_dma_region(struct
> gdma_dev *gd,
> >  req->page_addr_list_len = num_page;
> >
> >  for (i = 0; i < num_page; i++)
> > -       req->page_addr_list[i] = gmi->dma_handle +  i * PAGE_SIZE;
> > +       req->page_addr_list[i] = gmi->dma_handle +  i *
> MANA_MIN_QSIZE;
> >
> >  err = mana_gd_send_request(gc, req_msg_size, req, sizeof(resp),
> &resp);
> >  if (err)
> > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > index bbc4f9e16c98..038dc31e09cd 100644
> > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > @@ -362,12 +362,12 @@ static int mana_hwc_create_cq(struct
> hw_channel_context
> > *hwc, u16 q_depth,
> >  int err;
> >
> >  eq_size = roundup_pow_of_two(GDMA_EQE_SIZE * q_depth);
> > - if (eq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > -       eq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > + if (eq_size < MANA_MIN_QSIZE)
> > +       eq_size = MANA_MIN_QSIZE;
> >
> >  cq_size = roundup_pow_of_two(GDMA_CQE_SIZE * q_depth);
> > - if (cq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > -       cq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > + if (cq_size < MANA_MIN_QSIZE)
> > +       cq_size = MANA_MIN_QSIZE;
> >
> >  hwc_cq = kzalloc(sizeof(*hwc_cq), GFP_KERNEL);
> >  if (!hwc_cq)
> > @@ -429,7 +429,7 @@ static int mana_hwc_alloc_dma_buf(struct
> > hw_channel_context *hwc, u16 q_depth,
> >
> >  dma_buf->num_reqs = q_depth;
> >
> > - buf_size = PAGE_ALIGN(q_depth * max_msg_size);
> > + buf_size = MANA_MIN_QALIGN(q_depth * max_msg_size);
> >
> >  gmi = &dma_buf->mem_info;
> >  err = mana_gd_alloc_memory(gc, buf_size, gmi);
> > @@ -497,8 +497,8 @@ static int mana_hwc_create_wq(struct
> hw_channel_context
> > *hwc,
> >  else
> >         queue_size = roundup_pow_of_two(GDMA_MAX_SQE_SIZE *
> > q_depth);
> >
> > - if (queue_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > -       queue_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > + if (queue_size < MANA_MIN_QSIZE)
> > +       queue_size = MANA_MIN_QSIZE;
> >
> >  hwc_wq = kzalloc(sizeof(*hwc_wq), GFP_KERNEL);
> >  if (!hwc_wq)
> > @@ -628,10 +628,10 @@ static int mana_hwc_establish_channel(struct
> > gdma_context *gc, u16 *q_depth,
> >  init_completion(&hwc->hwc_init_eqe_comp);
> >
> >  err = mana_smc_setup_hwc(&gc->shm_channel, false,
> > -                   eq->mem_info.dma_handle,
> > -                   cq->mem_info.dma_handle,
> > -                   rq->mem_info.dma_handle,
> > -                   sq->mem_info.dma_handle,
> > +                   virt_to_phys(eq->mem_info.virt_addr),
> > +                   virt_to_phys(cq->mem_info.virt_addr),
> > +                   virt_to_phys(rq->mem_info.virt_addr),
> > +                   virt_to_phys(sq->mem_info.virt_addr),
>
> This change seems unrelated to handling guest PAGE_SIZE values
> other than 4K.  Does it belong in a separate patch?  Or maybe it just
> needs an explanation in the commit message of this patch?

I know dma_handle is usually just the phys adr. But this is not always
True if IOMMU is used...
I have no problem to put it to a separate patch if desired.

>
> >                      eq->eq.msix_index);
> >  if (err)
> >         return err;
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index d087cf954f75..6a891dbce686 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1889,10 +1889,10 @@ static int mana_create_txq(struct
> mana_port_context
> > *apc,
> >   *  to prevent overflow.
> >   */
> >  txq_size = MAX_SEND_BUFFERS_PER_QUEUE * 32;
> > - BUILD_BUG_ON(!PAGE_ALIGNED(txq_size));
> > + BUILD_BUG_ON(!MANA_MIN_QALIGNED(txq_size));
> >
> >  cq_size = MAX_SEND_BUFFERS_PER_QUEUE * COMP_ENTRY_SIZE;
> > - cq_size = PAGE_ALIGN(cq_size);
> > + cq_size = MANA_MIN_QALIGN(cq_size);
> >
> >  gc = gd->gdma_context;
> >
> > @@ -2189,8 +2189,8 @@ static struct mana_rxq *mana_create_rxq(struct
> > mana_port_context *apc,
> >  if (err)
> >         goto out;
> >
> > - rq_size = PAGE_ALIGN(rq_size);
> > - cq_size = PAGE_ALIGN(cq_size);
> > + rq_size = MANA_MIN_QALIGN(rq_size);
> > + cq_size = MANA_MIN_QALIGN(cq_size);
> >
> >  /* Create RQ */
> >  memset(&spec, 0, sizeof(spec));
> > diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > index 5553af9c8085..9a54a163d8d1 100644
> > --- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > +++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/io.h>
> >  #include <linux/mm.h>
> >
> > +#include <net/mana/gdma.h>
> >  #include <net/mana/shm_channel.h>
> >
> >  #define PAGE_FRAME_L48_WIDTH_BYTES 6
> > @@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* EQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(eq_addr);
> > + frame_addr = MANA_PFN(eq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
>
> In mana_smc_setup_hwc() a few lines above this change, code using
> PAGE_ALIGNED() is unchanged.  Is it correct that the eq/cq/rq/sq
> addresses
> must be aligned to 64K if PAGE_SIZE is 64K?

Since we still using PHYS_PFN on them, if not aligned to PAGE_SIZE,
the lower bits may be lost. (You said the same below.)

>
> Related, I wonder about how MANA_PFN() is defined. If PAGE_SIZE is 64K,
> MANA_PFN() will first right-shift 16, then left shift 4. The net is
> right-shift 12,
> corresponding to the 4K chunks that MANA expects. But that approach
> guarantees
> that the rightmost 4 bits of the MANA PFN will always be zero. That's
> consistent
> with requiring the addresses to be PAGE_ALIGNED() to 64K, but I'm unclear
> whether
> that is really the requirement. You might compare with the definition of
> HVPFN_DOWN(), which has a similar goal for Linux guests communicating
> with
> Hyper-V.

@Paul Rosswurm You said MANA HW has "no page concept". So the "frame_addr"
In the mana_smc_setup_hwc() is NOT related to physical page number, correct?
Can we just use phys_adr >> 12 like below?

#define MANA_MIN_QSHIFT 12
#define MANA_PFN(a) ((a) >> MANA_MIN_QSHIFT)

      /* EQ addr: low 48 bits of frame address */
     shmem = (u64 *)ptr;
-     frame_addr = PHYS_PFN(eq_addr);
+     frame_addr = MANA_PFN(eq_addr);

>
> > @@ -191,7 +192,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* CQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(cq_addr);
> > + frame_addr = MANA_PFN(cq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > @@ -199,7 +200,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* RQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(rq_addr);
> > + frame_addr = MANA_PFN(rq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > @@ -207,7 +208,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* SQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(sq_addr);
> > + frame_addr = MANA_PFN(sq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> > index 27684135bb4d..b392559c33e9 100644
> > --- a/include/net/mana/gdma.h
> > +++ b/include/net/mana/gdma.h
> > @@ -224,7 +224,12 @@ struct gdma_dev {
> >  struct auxiliary_device *adev;
> >  };
> >
> > -#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
> > +/* These are defined by HW */
> > +#define MANA_MIN_QSHIFT 12
> > +#define MANA_MIN_QSIZE (1 << MANA_MIN_QSHIFT)
> > +#define MANA_MIN_QALIGN(x) ALIGN((x), MANA_MIN_QSIZE)
> > +#define MANA_MIN_QALIGNED(addr) IS_ALIGNED((unsigned long)(addr),
> MANA_MIN_QSIZE)
> > +#define MANA_PFN(a) (PHYS_PFN(a) << (PAGE_SHIFT - MANA_MIN_QSHIFT))
>
> See comments above about how this is defined.

Replied above.
Thank you for all the detailed comments!

- Haiyang