Re: [PATCH v3] lightnvm: physical block device (pblk) target

From: Bart Van Assche
Date: Sat Apr 08 2017 - 16:56:48 EST


On 04/07/17 11:50, Javier González wrote:
> Documentation/lightnvm/pblk.txt | 21 +
> drivers/lightnvm/Kconfig | 19 +
> drivers/lightnvm/Makefile | 5 +
> drivers/lightnvm/pblk-cache.c | 112 +++
> drivers/lightnvm/pblk-core.c | 1641 ++++++++++++++++++++++++++++++++++++++
> drivers/lightnvm/pblk-gc.c | 542 +++++++++++++
> drivers/lightnvm/pblk-init.c | 942 ++++++++++++++++++++++
> drivers/lightnvm/pblk-map.c | 135 ++++
> drivers/lightnvm/pblk-rb.c | 859 ++++++++++++++++++++
> drivers/lightnvm/pblk-read.c | 513 ++++++++++++
> drivers/lightnvm/pblk-recovery.c | 1007 +++++++++++++++++++++++
> drivers/lightnvm/pblk-rl.c | 182 +++++
> drivers/lightnvm/pblk-sysfs.c | 500 ++++++++++++
> drivers/lightnvm/pblk-write.c | 408 ++++++++++
> drivers/lightnvm/pblk.h | 1127 ++++++++++++++++++++++++++
> include/linux/lightnvm.h | 57 +-
> pblk-sysfs.c | 581 ++++++++++++++

This patch introduces two slightly different versions of pblk-sysfs.c -
one at the top level and one in drivers/lightnvm. Please remove the file
at the top level.

> +config NVM_PBLK_L2P_OPT
> + bool "PBLK with optimized L2P table for devices up to 8TB"
> + depends on NVM_PBLK
> + ---help---
> + Physical device addresses are typical 64-bit integers. Since we store
> + the logical to physical (L2P) table on the host, this takes 1/500 of
> + host memory (e.g., 2GB per 1TB of storage media). On drives under 8TB,
> + it is possible to reduce this to 1/1000 (e.g., 1GB per 1TB). This
> + option allows to do this optimization on the host L2P table.

Why is NVM_PBLK_L2P_OPT a compile-time option instead of a run-time
option? Since this define does not affect the definition of the ppa_addr
I don't see why this has to be a compile-time option. For e.g. Linux
distributors the only choice would be to disable NVM_PBLK_L2P_OPT. I
think it would be unfortunate if no Linux distribution ever would be
able to benefit from this optimization.

> +#ifdef CONFIG_NVM_DEBUG
> + atomic_add(nr_entries, &pblk->inflight_writes);
> + atomic_add(nr_entries, &pblk->req_writes);
> +#endif

Has it been considered to use the "static key" feature such that
consistency checks can be enabled at run-time instead of having to
rebuild the kernel to enable CONFIG_NVM_DEBUG?

> +#ifdef CONFIG_NVM_DEBUG
> + BUG_ON(nr_rec_entries != valid_entries);
> + atomic_add(valid_entries, &pblk->inflight_writes);
> + atomic_add(valid_entries, &pblk->recov_gc_writes);
> +#endif

Are you aware that Linus is strongly opposed against using BUG_ON()?

> +#ifdef CONFIG_NVM_DEBUG
> + lockdep_assert_held(&l_mg->free_lock);
> +#endif

Why is lockdep_assert_held() surrounded with #ifdef CONFIG_NVM_DEBUG /
#endif? Are you aware that lockdep_assert_held() do not generate any
code with CONFIG_PROVE_LOCKING=n?

> +static const struct block_device_operations pblk_fops = {
> + .owner = THIS_MODULE,
> +};

Is this data structure useful? If so, where is pblk_fops used?

> +static void pblk_l2p_free(struct pblk *pblk)
> +{
> + vfree(pblk->trans_map);
> +}
> +
> +static int pblk_l2p_init(struct pblk *pblk)
> +{
> + sector_t i;
> +
> + pblk->trans_map = vmalloc(sizeof(pblk_addr) * pblk->rl.nr_secs);
> + if (!pblk->trans_map)
> + return -ENOMEM;
> +
> + for (i = 0; i < pblk->rl.nr_secs; i++)
> + pblk_ppa_set_empty(&pblk->trans_map[i]);
> +
> + return 0;
> +}

Has it been considered to add support for keeping a subset of the L2P
translation table in memory instead of keeping it in memory in its entirety?

> + sprintf(cache_name, "pblk_line_m_%s", pblk->disk->disk_name);

Please use snprintf() or kasprintf() instead of printf(). That makes it
easier for humans to verify that no buffer overflow is triggered.

> +/* physical block device target */
> +static struct nvm_tgt_type tt_pblk = {
> + .name = "pblk",
> + .version = {1, 0, 0},

Are version numbers useful inside the kernel tree?

> +void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
> + unsigned long *lun_bitmap, unsigned int valid_secs,
> + unsigned int off)
> +{
> + struct pblk_sec_meta *meta_list = rqd->meta_list;
> + unsigned int map_secs;
> + int min = pblk->min_write_pgs;
> + int i;
> +
> + for (i = off; i < rqd->nr_ppas; i += min) {
> + map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
> + pblk_map_page_data(pblk, sentry + i, &rqd->ppa_list[i],
> + lun_bitmap, &meta_list[i], map_secs);
> + }
> +}

Has it been considered to implement the above code such that no modulo
(%) computation is needed, which is a relatively expensive operation? I
think for this loop that can be done easily.

> +static DECLARE_RWSEM(pblk_rb_lock);
> +
> +void pblk_rb_data_free(struct pblk_rb *rb)
> +{
> + struct pblk_rb_pages *p, *t;
> +
> + down_write(&pblk_rb_lock);
> + list_for_each_entry_safe(p, t, &rb->pages, list) {
> + free_pages((unsigned long)page_address(p->pages), p->order);
> + list_del(&p->list);
> + kfree(p);
> + }
> + up_write(&pblk_rb_lock);
> +}

Global locks like pblk_rb_lock are a performance bottleneck on
multi-socket systems. Why is that lock global?

Bart.