[SUGGESTION]: drop virtual merge accounting in I/O requests
From: Mikulas Patocka
Date: Thu Jul 10 2008 - 17:57:23 EST
Hi
I'm getting crashes with InitIO A100u2w controller on Sparc64 (I had to
fix the endianity issues in the driver, but that's unrelated).
When I examined the crashes, it turned out that SCSI layer passed requests
with too many segments. The controller has at most 32 SG entries per
request. It sets shost->sg_tablesize to 32, but despite this, larger
requests were submitted to it --- this resulted in overwriting random
memory and crashes.
A typical scenario, in the beginning of inia100_queue, looked like this:
A number of segments returned by scsi_sg_count(cmd) == 33
cmd->request->nr_hw_segments == 32
cmd->request->nr_phys_segments == 37
cmd->sdb.table.nents == 33
cmd->sdb.table.orig_nents == 37
The SCSI layer submitted a request with 33 scatter-gather segments.
When I was searching for a reson for this, I found that virtual merging
thing. The idea behind virtual merging is this: on architectures with
IOMMU, the kernel can program IOMMU in such a way that several
discontinuous pages in memory appear as continuous space for the PCI
device. Thus, entries in device's SG table are saved. The virtual merging
alone is good and harmless idea --- it just improves performance a bit by
reducing the number of regions visible to the device.
But now, look at the block layer:
The block layer, when merging requests, must make sure that it won't
overflow the device's SG table. So it accounts the number of physical
segments and it tries to account the number of segments after virtual
merging by the IOMMU is performed (see function blk_recalc_rq_segments and
similar). The block layer calls BIOVEC_VIRT_MERGEABLE and
BIOVEC_VIRT_OVERSIZE (these macros use architecture-specific constants
BIO_VMERGE_BOUNDARY and BIO_VMERGE_MAX_SIZE) to check if two segments can
be merged by the IOMMU and sets rq->nr_hw_segments appropriatelly.
What was causing the bug in my case - IO layer expected that two segments
could be virtually merged and so merged the I/O requests. However, the
IOMMU function dma_4u_map_sg didn't merge these segments (the specific
check that was failing was "outs->dma_length + s->length > max_seg_size"
--- but any of the three checks in that function could trigger that bug).
So it produced one more SG entry than the IO layer accounted and crashed
the SCSI host driver.
--
When I thought about it more, I realized that this accounting of virtual
segments in I/O layer can't work correctly at all. If an architecture
defines symbols BIOVEC_VIRT_MERGEABLE and BIOVEC_VIRT_OVERSIZE, it
declares that it's IOMMU must merge any two regions satisfying these
conditions. But in an IOMMU, it is impossible to guarantee, because:
* the bus address is allocated basiclly randomly, so we can hit
dev->dma_parms->segment_boundary_mask any time. This will prevent virtual
merging from happenning. I/O layer doesn't know the bus address at the
time it merges requests, so it can't predict when this happens.
* the IOMMU isn't guaranteed to find a continuous space in it's bus
address space. If it skips over already mapped regions, it can't perform
virtual merging.
* when creating the mapping, we can hit per-device limit
"dev->dma_parms->max_segment_size" --- but the I/O layer checks only
against global limit BIOVEC_VIRT_OVERSIZE. (this last issue is fixable,
the previous two are not).
Basically, the IOMMU treats virtual merging as an option that it can do to
increase performance, but doesn't have to do it in some boundary
conditions. And the I/O layer treats virtual merging as a mandatory
feature and expects that any two regions satisfying the alignment and size
can be virtually merged.
I would suggest to drop the virtual segment accounting from the I/O layer
at all. (I am not suggesting to drop the virtual merging itself). For SCSI
drivers with large SGLIST, dropping the virtual segment accounting will
make no difference (the merge will happen anyway, just the I/O layer won't
know about it in advance). For SCSI drivers with small SGLIST, like
A100u2w, dropping virtual merge accounting may slightly decrease
performance (the I/O layer won't use maximum request size considering
virtual merging) --- but it will make these drivers work. They don't work
now, because these off-by-one overshoots in number of segments are
inevitable.
Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/