Python Scrapy:提取链接和文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27753232/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrapy: Extract links and text
提问by Prakhar Mohan Srivastava
I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
我是scrapy的新手,我正在尝试抓取宜家网站网页。带有此处给出的位置列表的基本页面。
My items.pyfile is given below:
我的items.py文件如下:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
And the spideris given below:
下面给出了蜘蛛:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
On running the file I am not getting any output. The json file output is something like:
在运行文件时,我没有得到任何输出。json 文件输出类似于:
[[{"link": [], "name": []}
The output that I am looking for is the name of the location and the link. I am getting nothing. Where am I going wrong?
我正在寻找的输出是位置和链接的名称。我什么也得不到。我哪里错了?
采纳答案by alecxe
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a
tags, you don't need to specify a
in the inner xpath expressions. In other words, currently you are searching for a
tags inside the a
tags inside the td
inside tr
. Which obviously results into nothing.
项目字段的 xpath 表达式中有一个简单的错误。循环已经遍历了a
标签,您不需要a
在内部 xpath 表达式中指定。换句话说,目前您正在搜索a
的内部标签a
里面的标签td
内tr
。这显然导致什么都没有。
Replace a/text()
with text()
and a/@href
with @href
.
更换a/text()
用text()
和a/@href
用@href
。
(tested - works for me)
(经过测试 - 对我有用)
回答by Ganesh
use this....
用这个....
item['name'] = sel.xpath('//a/text()').extract()
item['link'] = sel.xpath('//a/@href').extract()