Python Scrapy:提取链接和文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27753232/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:12:59  来源:igfitidea点击:

Scrapy: Extract links and text

pythonweb-scrapingscrapyscrapy-spider

提问by Prakhar Mohan Srivastava

I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.

我是scrapy的新手,我正在尝试抓取宜家网站网页。带有此处给出的位置列表的基本页面。

My items.pyfile is given below:

我的items.py文件如下:

import scrapy


class IkeaItem(scrapy.Item):

    name = scrapy.Field()
    link = scrapy.Field()

And the spideris given below:

下面给出了蜘蛛

import  scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
    name = 'ikea'

    allowed_domains = ['http://www.ikea.com/']

    start_urls = ['http://www.ikea.com/']

    def parse(self, response):
        for sel in response.xpath('//tr/td/a'):
            item = IkeaItem()
            item['name'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()

            yield item

On running the file I am not getting any output. The json file output is something like:

在运行文件时,我没有得到任何输出。json 文件输出类似于:

[[{"link": [], "name": []}

The output that I am looking for is the name of the location and the link. I am getting nothing. Where am I going wrong?

我正在寻找的输出是位置和链接的名称。我什么也得不到。我哪里错了?

采纳答案by alecxe

There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the atags, you don't need to specify ain the inner xpath expressions. In other words, currently you are searching for atags inside the atags inside the tdinside tr. Which obviously results into nothing.

项目字段的 xpath 表达式中有一个简单的错误。循环已经遍历了a标签,您不需要a在内部 xpath 表达式中指定。换句话说,目前您正在搜索a的内部标签a里面的标签tdtr。这显然导致什么都没有。

Replace a/text()with text()and a/@hrefwith @href.

更换a/text()text()a/@href@href

(tested - works for me)

(经过测试 - 对我有用)

回答by Ganesh

use this....

用这个....

    item['name'] = sel.xpath('//a/text()').extract()
    item['link'] = sel.xpath('//a/@href').extract()