Python Scrapy：提取链接和文本

Question

提问by Prakhar Mohan Srivastava

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.

我是scrapy的新手，我正在尝试抓取宜家网站网页。带有此处给出的位置列表的基本页面。

My items.pyfile is given below:

我的items.py文件如下：

import scrapy


class IkeaItem(scrapy.Item):

    name = scrapy.Field()
    link = scrapy.Field()

And the spideris given below:

下面给出了蜘蛛：

import  scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
    name = 'ikea'

    allowed_domains = ['http://www.ikea.com/']

    start_urls = ['http://www.ikea.com/']

    def parse(self, response):
        for sel in response.xpath('//tr/td/a'):
            item = IkeaItem()
            item['name'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()

            yield item

On running the file I am not getting any output. The json file output is something like:

在运行文件时，我没有得到任何输出。json 文件输出类似于：

[[{"link": [], "name": []}

The output that I am looking for is the name of the location and the link. I am getting nothing. Where am I going wrong?

我正在寻找的输出是位置和链接的名称。我什么也得不到。我哪里错了？

Answer 1

采纳答案by alecxe

There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the atags, you don't need to specify ain the inner xpath expressions. In other words, currently you are searching for atags inside the atags inside the tdinside tr. Which obviously results into nothing.

项目字段的 xpath 表达式中有一个简单的错误。循环已经遍历了a标签，您不需要a在内部 xpath 表达式中指定。换句话说，目前您正在搜索a的内部标签a里面的标签td内tr。这显然导致什么都没有。

Replace a/text()with text()and a/@hrefwith @href.

更换a/text()用text()和a/@href用@href。

(tested - works for me)

（经过测试 - 对我有用）

Answer 2

回答by Ganesh

use this....

用这个....

    item['name'] = sel.xpath('//a/text()').extract()
    item['link'] = sel.xpath('//a/@href').extract()

Python Scrapy：提取链接和文本

提问by Prakhar Mohan Srivastava

采纳答案by alecxe

回答by Ganesh

相关推荐

最近更新

标签

Python Scrapy：提取链接和文本

提问by Prakhar Mohan Srivastava

采纳答案by alecxe

回答by Ganesh

相关推荐

Python Scikit-learn 返回小于 -1 的决定系数 (R^2) 值

Python 将 HDF5 用于大型阵列存储（而不是平面二进制文件）是否具有分析速度或内存使用优势？

TypeError: 不支持的操作数类型 -: 'float' 和 'NoneType' python

Python “ValueError：无法从重复轴重新索引”

相关推荐

最近更新

标签