Re: [PATCH v3] lightnvm: physical block device (pblk) target
From: Javier GonzÃlez
Date: Sun Apr 09 2017 - 05:15:37 EST
Hi Bart,
Thanks for reviewing the code.
> On 8 Apr 2017, at 22.56, Bart Van Assche <bart.vanassche@xxxxxxxxxxx> wrote:
>
> On 04/07/17 11:50, Javier GonzÃlez wrote:
>> Documentation/lightnvm/pblk.txt | 21 +
>> drivers/lightnvm/Kconfig | 19 +
>> drivers/lightnvm/Makefile | 5 +
>> drivers/lightnvm/pblk-cache.c | 112 +++
>> drivers/lightnvm/pblk-core.c | 1641 ++++++++++++++++++++++++++++++++++++++
>> drivers/lightnvm/pblk-gc.c | 542 +++++++++++++
>> drivers/lightnvm/pblk-init.c | 942 ++++++++++++++++++++++
>> drivers/lightnvm/pblk-map.c | 135 ++++
>> drivers/lightnvm/pblk-rb.c | 859 ++++++++++++++++++++
>> drivers/lightnvm/pblk-read.c | 513 ++++++++++++
>> drivers/lightnvm/pblk-recovery.c | 1007 +++++++++++++++++++++++
>> drivers/lightnvm/pblk-rl.c | 182 +++++
>> drivers/lightnvm/pblk-sysfs.c | 500 ++++++++++++
>> drivers/lightnvm/pblk-write.c | 408 ++++++++++
>> drivers/lightnvm/pblk.h | 1127 ++++++++++++++++++++++++++
>> include/linux/lightnvm.h | 57 +-
>> pblk-sysfs.c | 581 ++++++++++++++
>
> This patch introduces two slightly different versions of pblk-sysfs.c -
> one at the top level and one in drivers/lightnvm. Please remove the file
> at the top level.
The top level file is a mistake; I'll remove it.
>
>> +config NVM_PBLK_L2P_OPT
>> + bool "PBLK with optimized L2P table for devices up to 8TB"
>> + depends on NVM_PBLK
>> + ---help---
>> + Physical device addresses are typical 64-bit integers. Since we store
>> + the logical to physical (L2P) table on the host, this takes 1/500 of
>> + host memory (e.g., 2GB per 1TB of storage media). On drives under 8TB,
>> + it is possible to reduce this to 1/1000 (e.g., 1GB per 1TB). This
>> + option allows to do this optimization on the host L2P table.
>
> Why is NVM_PBLK_L2P_OPT a compile-time option instead of a run-time
> option? Since this define does not affect the definition of the ppa_addr
> I don't see why this has to be a compile-time option. For e.g. Linux
> distributors the only choice would be to disable NVM_PBLK_L2P_OPT. I
> think it would be unfortunate if no Linux distribution ever would be
> able to benefit from this optimization.
struct ppa_addr, which is the physical address format is not affected,
but pblk's internal L2P address representation (pblk_addr) is. You can
see that the type either represents struct ppa_addr or ppa_addr_32. How
would you define a type that can either be u64 or u32 with different bit
offsets at run-time? Note that address conversions to this type is in
the fast path and this format allows us to only use bit shifts.
>
>> +#ifdef CONFIG_NVM_DEBUG
>> + atomic_add(nr_entries, &pblk->inflight_writes);
>> + atomic_add(nr_entries, &pblk->req_writes);
>> +#endif
>
> Has it been considered to use the "static key" feature such that
> consistency checks can be enabled at run-time instead of having to
> rebuild the kernel to enable CONFIG_NVM_DEBUG?
I haven't considered it. I'll look into it. I would like to have this
counters and the corresponding sysfs entry only available on debug mode
since it allows us to have a good picture of the FTL state.
>
>> +#ifdef CONFIG_NVM_DEBUG
>> + BUG_ON(nr_rec_entries != valid_entries);
>> + atomic_add(valid_entries, &pblk->inflight_writes);
>> + atomic_add(valid_entries, &pblk->recov_gc_writes);
>> +#endif
>
> Are you aware that Linus is strongly opposed against using BUG_ON()?
>
Yes, I am aware of the discussions around BUG_ON. The rationale on the
cases we have it is that they represent a pblk internal state error.
This will most probably result on a wild memory access or an
out-of-bound, which will eventually cause the kernel to crash either
way. You can see that most of them are under CONFIG_NVM_DEBUG. Would it
make sense to maintain CONFIG_NVM_DEBUG so that all BUG_ON checks are
contained within them? As far as I can read, this is not possible with
"static key"
>> +#ifdef CONFIG_NVM_DEBUG
>> + lockdep_assert_held(&l_mg->free_lock);
>> +#endif
>
> Why is lockdep_assert_held() surrounded with #ifdef CONFIG_NVM_DEBUG /
> #endif? Are you aware that lockdep_assert_held() do not generate any
> code with CONFIG_PROVE_LOCKING=n?
I did not know about CONFIG_PROVE_LOCKING, thanks for pointing it out.
>
>> +static const struct block_device_operations pblk_fops = {
>> + .owner = THIS_MODULE,
>> +};
>
> Is this data structure useful? If so, where is pblk_fops used?
It is not useful anymore. I'll remove it.
>
>> +static void pblk_l2p_free(struct pblk *pblk)
>> +{
>> + vfree(pblk->trans_map);
>> +}
>> +
>> +static int pblk_l2p_init(struct pblk *pblk)
>> +{
>> + sector_t i;
>> +
>> + pblk->trans_map = vmalloc(sizeof(pblk_addr) * pblk->rl.nr_secs);
>> + if (!pblk->trans_map)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < pblk->rl.nr_secs; i++)
>> + pblk_ppa_set_empty(&pblk->trans_map[i]);
>> +
>> + return 0;
>> +}
>
> Has it been considered to add support for keeping a subset of the L2P
> translation table in memory instead of keeping it in memory in its entirety?
Yes. L2P caching is on our roadmap and will be included in the future.
>
>> + sprintf(cache_name, "pblk_line_m_%s", pblk->disk->disk_name);
>
> Please use snprintf() or kasprintf() instead of printf(). That makes it
> easier for humans to verify that no buffer overflow is triggered.
>
Ok.
>> +/* physical block device target */
>> +static struct nvm_tgt_type tt_pblk = {
>> + .name = "pblk",
>> + .version = {1, 0, 0},
>
> Are version numbers useful inside the kernel tree?
It allows us to relate in to the disk format version, but most probably
the version on the actual disk format structures is enough.
>
>> +void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, unsigned int sentry,
>> + unsigned long *lun_bitmap, unsigned int valid_secs,
>> + unsigned int off)
>> +{
>> + struct pblk_sec_meta *meta_list = rqd->meta_list;
>> + unsigned int map_secs;
>> + int min = pblk->min_write_pgs;
>> + int i;
>> +
>> + for (i = off; i < rqd->nr_ppas; i += min) {
>> + map_secs = (i + min > valid_secs) ? (valid_secs % min) : min;
>> + pblk_map_page_data(pblk, sentry + i, &rqd->ppa_list[i],
>> + lun_bitmap, &meta_list[i], map_secs);
>> + }
>> +}
>
> Has it been considered to implement the above code such that no modulo
> (%) computation is needed, which is a relatively expensive operation? I
> think for this loop that can be done easily.
I'll look into it.
>
>> +static DECLARE_RWSEM(pblk_rb_lock);
>> +
>> +void pblk_rb_data_free(struct pblk_rb *rb)
>> +{
>> + struct pblk_rb_pages *p, *t;
>> +
>> + down_write(&pblk_rb_lock);
>> + list_for_each_entry_safe(p, t, &rb->pages, list) {
>> + free_pages((unsigned long)page_address(p->pages), p->order);
>> + list_del(&p->list);
>> + kfree(p);
>> + }
>> + up_write(&pblk_rb_lock);
>> +}
>
> Global locks like pblk_rb_lock are a performance bottleneck on
> multi-socket systems. Why is that lock global?
This global lock is only used for the write buffer init and exit; it
does not touch the fast path. On tear down it guarantees that we flush
all the data present on the buffer before freeing the buffer itself.
>
> Bart.
Javier
Attachment:
signature.asc
Description: Message signed with OpenPGP