Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

From: Tariq Toukan
Date: Wed May 16 2018 - 02:32:14 EST

Next message: Paul Mackerras: "Re: [PATCH v2] powerpc: kvm: Change return type to vm_fault_t"
Previous message: George Cherian: "[PATCH 1/4] i2c: xlp9xx: Add support for SMBAlert"
Next in thread: Gi-Oh Kim: "Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 15/05/2018 9:53 PM, Qing Huang wrote:

On 5/15/2018 2:19 AM, Tariq Toukan wrote:

On 14/05/2018 7:41 PM, Qing Huang wrote:

On 5/13/2018 2:00 AM, Tariq Toukan wrote:

On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <qing.huang@xxxxxxxxxx>
Acked-by: Daniel Jurgens <danielj@xxxxxxxxxxxx>
Reviewed-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

Â drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
Â 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
Â #include "fw.h"
Â Â /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
ÂÂ */
Â enum {
-ÂÂÂ MLX4_ICM_ALLOC_SIZEÂÂÂ = 1 << 18,
-ÂÂÂ MLX4_TABLE_CHUNK_SIZEÂÂÂ = 1 << 18
+ÂÂÂ MLX4_ICM_ALLOC_SIZEÂÂÂ = 1 << PAGE_SHIFT,
+ÂÂÂ MLX4_TABLE_CHUNK_SIZEÂÂÂ = 1 << PAGE_SHIFT

Which is actually PAGE_SIZE.

Yes, we wanted to avoid high order memory allocations.

Then please use PAGE_SIZE instead.

PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT is actually more appropriate here.

Definition of PAGE_SIZE varies among different archs.
It is not always as simple as 1 << PAGE_SHIFT.
It might be:
PAGE_SIZE (1UL << PAGE_SHIFT)
PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
etc...

Please replace 1 << PAGE_SHIFT with PAGE_SIZE.

Also, please add a comma at the end of the last entry.

Hmm..., followed the existing code style and checkpatch.pl didn't complain about the comma.

I am in favor of having a comma also after the last element, so that when another enum element is added we do not modify this line again, which would falsely affect git blame.

I know it didn't exist before your patch, but once we're here, let's do it.

I'm okay either way. If adding an extra comma is preferred by many people, someone should update checkpatch.pl to enforce it. :)

I agree.
Until then, please use an extra comma in this patch.

Â };
Â Â static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
ÂÂÂÂÂ obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
ÂÂÂÂÂ num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
Â -ÂÂÂ table->icmÂÂÂÂÂ = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+ÂÂÂ table->icmÂÂÂÂÂ = vzalloc(num_icm * sizeof(*table->icm));

Why not kvzalloc ?

I think table->icm pointer array doesn't really need physically contiguous memory. Sometimes high order
memory allocation by kmalloc variants may trigger slow path and cause tasks to be blocked.

This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so we maintain a similar behavior when contiguous memory is available, and a fallback for resiliency.

No sure what exactly degradation is caused by vzalloc here. I think it's better to keep physically contiguous pages
to other requests which really need them. Besides slow path/mem compacting can be really expensive.

Degradation is expected when you replace a contig memory with non-contig memory, without any perf test.
We agree that when contig memory is not available, we should use non-contig instead of simply failing, and for this you can call kvzalloc.

Thanks,
Qing

ÂÂÂÂÂ if (!table->icm)
ÂÂÂÂÂÂÂÂÂ return -ENOMEM;
ÂÂÂÂÂ table->virtÂÂÂÂ = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
ÂÂÂÂÂÂÂÂÂÂÂÂÂ mlx4_free_icm(dev, table->icm[i], use_coherent);
ÂÂÂÂÂÂÂÂÂ }
Â -ÂÂÂ kfree(table->icm);
+ÂÂÂ vfree(table->icm);
Â ÂÂÂÂÂ return -ENOMEM;
Â }
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
ÂÂÂÂÂÂÂÂÂÂÂÂÂ mlx4_free_icm(dev, table->icm[i], table->coherent);
ÂÂÂÂÂÂÂÂÂ }
Â -ÂÂÂ kfree(table->icm);
+ÂÂÂ vfree(table->icm);
Â }

Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

Next message: Paul Mackerras: "Re: [PATCH v2] powerpc: kvm: Change return type to vm_fault_t"
Previous message: George Cherian: "[PATCH 1/4] i2c: xlp9xx: Add support for SMBAlert"
Next in thread: Gi-Oh Kim: "Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]