[PATCH] final PCI patches for 2.6.26

From: Jesse Barnes
Date: Thu Jun 12 2008 - 16:34:19 EST


Some of these may not have seen review on linux-pci, so I'm sending them out
for comment before I send Linus the pull request.

Any comments/objections?

The biggest patch by far is the resource_wc stuff; it should have been pushed
earlier, but for some reason I thought Ingo & co. were going to do that...

Thanks,
Jesse

commit a123c3d9643464c56417680536ed9515f15cffa1
Author: Yinghai Lu <Yinghai.Lu@xxxxxxx>
Date: Mon May 12 21:21:05 2008 +0200

PCI: use dev_to_node in pci_call_probe

to make sure get one online node.

Signed-off-by: Yinghai Lu <yinghai.lu@xxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Signed-off-by: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 72cf61e..e1637bd 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -181,7 +181,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
any need to change it. */
struct mempolicy *oldpol;
cpumask_t oldmask = current->cpus_allowed;
- int node = pcibus_to_node(dev->bus);
+ int node = dev_to_node(&dev->dev);

if (node >= 0) {
node_to_cpumask_ptr(nodecpumask, node);

commit d99f75679d17ada3e90a04c50605f90a6a590e45
Author: Miquel van Smoorenburg <mikevs@xxxxxxxxxx>
Date: Thu Jun 5 18:14:44 2008 +0200

x86/PCI: pci-dma.c: don't always add __GFP_NORETRY to gfp

Currently arch/x86/kernel/pci-dma.c always adds __GFP_NORETRY
to the allocation flags, because it wants to be reasonably
sure not to deadlock when calling alloc_pages().

But really that should only be done in two cases:
- when allocating memory in the lower 16 MB DMA zone.
If there's no free memory there, waiting or OOM killing is of no use
- when optimistically trying an allocation in the DMA32 zone
when dma_mask < DMA_32BIT_MASK hoping that the allocation
happens to fall within the limits of the dma_mask

Also blindly adding __GFP_NORETRY to the the gfp variable might
not be a good idea since we then also use it when calling
dma_ops->alloc_coherent(). Clearing it might also not be a
good idea, dma_alloc_coherent()'s caller might have set it
on purpose. The gfp variable should not be clobbered.

[ mingo@xxxxxxx: converted to delta patch ontop of previous version. ]

Signed-off-by: Miquel van Smoorenburg <miquels@xxxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Signed-off-by: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 069e843..dc00a13 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -378,6 +378,7 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
struct page *page;
unsigned long dma_mask = 0;
dma_addr_t bus;
+ int noretry = 0;

/* ignore region specifiers */
gfp &= ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32);
@@ -397,19 +398,25 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
if (dev->dma_mask == NULL)
return NULL;

+ /* Don't invoke OOM killer or retry in lower 16MB DMA zone */
+ if (gfp & __GFP_DMA)
+ noretry = 1;
+
#ifdef CONFIG_X86_64
/* Why <=? Even when the mask is smaller than 4GB it is often
larger than 16MB and in this case we have a chance of
finding fitting memory in the next higher zone first. If
not retry with true GFP_DMA. -AK */
- if (dma_mask <= DMA_32BIT_MASK && !(gfp & GFP_DMA))
+ if (dma_mask <= DMA_32BIT_MASK && !(gfp & GFP_DMA)) {
gfp |= GFP_DMA32;
+ if (dma_mask < DMA_32BIT_MASK)
+ noretry = 1;
+ }
#endif

again:
- /* Don't invoke OOM killer or retry in lower 16MB DMA zone */
page = dma_alloc_pages(dev,
- (gfp & GFP_DMA) ? gfp | __GFP_NORETRY : gfp, get_order(size));
+ noretry ? gfp | __GFP_NORETRY : gfp, get_order(size));
if (page == NULL)
return NULL;


commit 7a7d16722262b50fb8433d1340153bceabf84af2
Author: Miquel van Smoorenburg <mikevs@xxxxxxxxxx>
Date: Wed May 28 10:31:25 2008 +0200

x86/PCI: pci-dma.c: use __GFP_NO_OOM instead of __GFP_NORETRY

On Wed, 2008-05-28 at 04:47 +0200, Andi Kleen wrote:
> > So... why not just remove the setting of __GFP_NORETRY? Why is it
> > wrong to oom-kill things in this case?
>
> When the 16MB zone overflows (which can be common in some workloads)
> calling the OOM killer is pretty useless because it has barely any
> real user data [only exception would be the "only 16MB" case Alan
> mentioned]. Killing random processes in this case is bad.
>
> I think for 16MB __GFP_NORETRY is ok because there should be
> nothing freeable in there so looping is useless. Only exception would be the
> "only 16MB total" case again but I'm not sure 2.6 supports that at all
> on x86.
>
> On the other hand d_a_c() does more allocations than just 16MB, especially
> on 64bit and the other zones need different strategies.

Okay, so how about this then ?

Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Signed-off-by: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index c5ef1af..069e843 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -397,9 +397,6 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
if (dev->dma_mask == NULL)
return NULL;

- /* Don't invoke OOM killer */
- gfp |= __GFP_NORETRY;
-
#ifdef CONFIG_X86_64
/* Why <=? Even when the mask is smaller than 4GB it is often
larger than 16MB and in this case we have a chance of
@@ -410,7 +407,9 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
#endif

again:
- page = dma_alloc_pages(dev, gfp, get_order(size));
+ /* Don't invoke OOM killer or retry in lower 16MB DMA zone */
+ page = dma_alloc_pages(dev,
+ (gfp & GFP_DMA) ? gfp | __GFP_NORETRY : gfp, get_order(size));
if (page == NULL)
return NULL;


commit db80e22298e1c774d7d59ee7ba622abd6248f8a5
Author: Jesse Barnes <jbarnes@xxxxxxxxxx>
Date: Thu Jun 12 12:39:32 2008 -0700

PCI: fixup write combine comment in pci_mmap_resource

Now that we can actually do write combining properly, there's no need to have
the FIXME.

Signed-off-by: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 9ec7d39..6f3c744 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -492,7 +492,6 @@ pci_mmap_legacy_mem(struct kobject *kobj, struct bin_attribute *attr,
* @write_combine: 1 for write_combine mapping
*
* Use the regular PCI mapping routines to map a PCI resource into userspace.
- * FIXME: write combining? maybe automatic for prefetchable regions?
*/
static int
pci_mmap_resource(struct kobject *kobj, struct bin_attribute *attr,

commit 3361cb322a7aee901b7cb478c3bec65b8f2e7695
Author: venkatesh.pallipadi@xxxxxxxxx <venkatesh.pallipadi@xxxxxxxxx>
Date: Tue Mar 18 17:00:22 2008 -0700

x86: PAT export resource_wc in pci sysfs

For the ranges with IORESOURCE_PREFETCH, export a new resource_wc interface in
pci /sysfs along with resource (which is uncached).

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@xxxxxxxxx>
Signed-off-by: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Signed-off-by: Jesse Barnes <jbarnes@xxxxxxxxxxxxxxxx>

diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
index 5daa2aa..68ef488 100644
--- a/Documentation/filesystems/sysfs-pci.txt
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -36,6 +36,7 @@ files, each with their own function.
local_cpus nearby CPU mask (cpumask, ro)
resource PCI resource host addresses (ascii, ro)
resource0..N PCI resource N, if present (binary, mmap)
+ resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
rom PCI ROM resource, if present (binary, ro)
subsystem_device PCI subsystem device (ascii, ro)
subsystem_vendor PCI subsystem vendor (ascii, ro)
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 271d41c..9ec7d39 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -489,13 +489,14 @@ pci_mmap_legacy_mem(struct kobject *kobj, struct bin_attribute *attr,
* @kobj: kobject for mapping
* @attr: struct bin_attribute for the file being mapped
* @vma: struct vm_area_struct passed into the mmap
+ * @write_combine: 1 for write_combine mapping
*
* Use the regular PCI mapping routines to map a PCI resource into userspace.
* FIXME: write combining? maybe automatic for prefetchable regions?
*/
static int
pci_mmap_resource(struct kobject *kobj, struct bin_attribute *attr,
- struct vm_area_struct *vma)
+ struct vm_area_struct *vma, int write_combine)
{
struct pci_dev *pdev = to_pci_dev(container_of(kobj,
struct device, kobj));
@@ -518,7 +519,21 @@ pci_mmap_resource(struct kobject *kobj, struct bin_attribute *attr,
vma->vm_pgoff += start >> PAGE_SHIFT;
mmap_type = res->flags & IORESOURCE_MEM ? pci_mmap_mem : pci_mmap_io;

- return pci_mmap_page_range(pdev, vma, mmap_type, 0);
+ return pci_mmap_page_range(pdev, vma, mmap_type, write_combine);
+}
+
+static int
+pci_mmap_resource_uc(struct kobject *kobj, struct bin_attribute *attr,
+ struct vm_area_struct *vma)
+{
+ return pci_mmap_resource(kobj, attr, vma, 0);
+}
+
+static int
+pci_mmap_resource_wc(struct kobject *kobj, struct bin_attribute *attr,
+ struct vm_area_struct *vma)
+{
+ return pci_mmap_resource(kobj, attr, vma, 1);
}

/**
@@ -541,9 +556,46 @@ pci_remove_resource_files(struct pci_dev *pdev)
sysfs_remove_bin_file(&pdev->dev.kobj, res_attr);
kfree(res_attr);
}
+
+ res_attr = pdev->res_attr_wc[i];
+ if (res_attr) {
+ sysfs_remove_bin_file(&pdev->dev.kobj, res_attr);
+ kfree(res_attr);
+ }
}
}

+static int pci_create_attr(struct pci_dev *pdev, int num, int write_combine)
+{
+ /* allocate attribute structure, piggyback attribute name */
+ int name_len = write_combine ? 13 : 10;
+ struct bin_attribute *res_attr;
+ int retval;
+
+ res_attr = kzalloc(sizeof(*res_attr) + name_len, GFP_ATOMIC);
+ if (res_attr) {
+ char *res_attr_name = (char *)(res_attr + 1);
+
+ if (write_combine) {
+ pdev->res_attr_wc[num] = res_attr;
+ sprintf(res_attr_name, "resource%d_wc", num);
+ res_attr->mmap = pci_mmap_resource_wc;
+ } else {
+ pdev->res_attr[num] = res_attr;
+ sprintf(res_attr_name, "resource%d", num);
+ res_attr->mmap = pci_mmap_resource_uc;
+ }
+ res_attr->attr.name = res_attr_name;
+ res_attr->attr.mode = S_IRUSR | S_IWUSR;
+ res_attr->size = pci_resource_len(pdev, num);
+ res_attr->private = &pdev->resource[num];
+ retval = sysfs_create_bin_file(&pdev->dev.kobj, res_attr);
+ } else
+ retval = -ENOMEM;
+
+ return retval;
+}
+
/**
* pci_create_resource_files - create resource files in sysfs for @dev
* @dev: dev in question
@@ -557,31 +609,19 @@ static int pci_create_resource_files(struct pci_dev *pdev)

/* Expose the PCI resources from this device as files */
for (i = 0; i < PCI_ROM_RESOURCE; i++) {
- struct bin_attribute *res_attr;

/* skip empty resources */
if (!pci_resource_len(pdev, i))
continue;

- /* allocate attribute structure, piggyback attribute name */
- res_attr = kzalloc(sizeof(*res_attr) + 10, GFP_ATOMIC);
- if (res_attr) {
- char *res_attr_name = (char *)(res_attr + 1);
-
- pdev->res_attr[i] = res_attr;
- sprintf(res_attr_name, "resource%d", i);
- res_attr->attr.name = res_attr_name;
- res_attr->attr.mode = S_IRUSR | S_IWUSR;
- res_attr->size = pci_resource_len(pdev, i);
- res_attr->mmap = pci_mmap_resource;
- res_attr->private = &pdev->resource[i];
- retval = sysfs_create_bin_file(&pdev->dev.kobj, res_attr);
- if (retval) {
- pci_remove_resource_files(pdev);
- return retval;
- }
- } else {
- return -ENOMEM;
+ retval = pci_create_attr(pdev, i, 0);
+ /* for prefetchable resources, create a WC mappable file */
+ if (!retval && pdev->resource[i].flags & IORESOURCE_PREFETCH)
+ retval = pci_create_attr(pdev, i, 1);
+
+ if (retval) {
+ pci_remove_resource_files(pdev);
+ return retval;
}
}
return 0;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 509159b..d18b1dd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -206,6 +206,7 @@ struct pci_dev {
struct bin_attribute *rom_attr; /* attribute descriptor for sysfs ROM entry */
int rom_attr_enabled; /* has display of the rom attribute been enabled? */
struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
+ struct bin_attribute *res_attr_wc[DEVICE_COUNT_RESOURCE]; /* sysfs file for WC mapping of
resources */
#ifdef CONFIG_PCI_MSI
struct list_head msi_list;
#endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/