Python selenium 与scrapy 用于动态页面
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17975471/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
selenium with scrapy for dynamic page
提问by Z. Lin
I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:
我正在尝试使用 scrapy 从网页中抓取产品信息。我要抓取的网页是这样的:
- starts with a product_list page with 10 products
- a click on "next" button loads the next 10 products (url doesn't change between the two pages)
- i use LinkExtractor to follow each product link into the product page, and get all the information I need
- 从包含 10 个产品的 product_list 页面开始
- 单击“下一步”按钮会加载接下来的 10 个产品(两个页面之间的 url 不会更改)
- 我使用 LinkExtractor 跟随每个产品链接进入产品页面,并获取我需要的所有信息
I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?
我试图复制 next-button-ajax-call 但无法正常工作,所以我尝试使用 selenium。我可以在单独的脚本中运行 selenium 的 webdriver,但我不知道如何与 scrapy 集成。我应该把硒部分放在我的scrapy蜘蛛中吗?
My spider is pretty standard, like the following:
我的蜘蛛非常标准,如下所示:
class ProductSpider(CrawlSpider):
name = "product_spider"
allowed_domains = ['example.com']
start_urls = ['http://example.com/shanghai']
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
]
def parse_product(self, response):
self.log("parsing product %s" %response.url, level=INFO)
hxs = HtmlXPathSelector(response)
# actual data follows
Any idea is appreciated. Thank you!
任何想法表示赞赏。谢谢!
采纳答案by alecxe
It really depends on how do you need to scrape the site and how and what data do you want to get.
这实际上取决于您需要如何抓取网站以及您想要获取的数据和方式。
Here's an example how you can follow pagination on ebay using Scrapy
+Selenium
:
这是一个如何使用Scrapy
+在 ebay 上跟踪分页的示例Selenium
:
import scrapy
from selenium import webdriver
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()
Here are some examples of "selenium spiders":
以下是“硒蜘蛛”的一些示例:
- Executing Javascript Submit form functions using scrapy in python
- https://gist.github.com/cheekybastard/4944914
- https://gist.github.com/irfani/1045108
- http://snipplr.com/view/66998/
- 在python中使用scrapy执行Javascript提交表单功能
- https://gist.github.com/cheekybastard/4944914
- https://gist.github.com/irfani/1045108
- http://snipplr.com/view/66998/
There is also an alternative to having to use Selenium
with Scrapy
. In some cases, using ScrapyJS
middlewareis enough to handle the dynamic parts of a page. Sample real-world usage:
除了必须使用Selenium
with 之外,还有一种替代方法Scrapy
。在某些情况下,使用ScrapyJS
中间件足以处理页面的动态部分。实际使用示例:
回答by Expired Brain
If (url doesn't change between the two pages) then you should add dont_filter=Truewith your scrapy.Request() or scrapy will find this url as a duplicate after processing first page.
如果(两个页面之间的 url 没有变化)那么你应该在你的 scrapy.Request() 中添加dont_filter=True或者在处理第一页后,scrapy 会发现这个 url 是重复的。
If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middlewarewhich can handle javascript pages using selenium or you can do that by launching any headless browser
如果你需要用 javascript 渲染页面,你应该使用scrapy-splash,你也可以检查这个可以使用 selenium 处理 javascript 页面的scrapy中间件,或者你可以通过启动任何无头浏览器来做到这一点
But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.
但更有效和更快的解决方案是检查您的浏览器并查看在提交表单或触发某个事件期间发出了什么请求。尝试模拟浏览器发送的相同请求。如果您可以正确复制请求,您将获得所需的数据。
Here is an example :
这是一个例子:
class ScrollScraper(Spider):
name = "scrollingscraper"
quote_url = "http://quotes.toscrape.com/api/quotes?page="
start_urls = [quote_url + "1"]
def parse(self, response):
quote_item = QuoteItem()
print response.body
data = json.loads(response.body)
for item in data.get('quotes', []):
quote_item["author"] = item.get('author', {}).get('name')
quote_item['quote'] = item.get('text')
quote_item['tags'] = item.get('tags')
yield quote_item
if data['has_next']:
next_page = data['page'] + 1
yield Request(self.quote_url + str(next_page))
When pagination url is same for every pages & uses POST request then you can use scrapy.FormRequest()instead of scrapy.Request(), both are same but FormRequest adds a new argument (formdata=) to the constructor.
当每个页面的分页 url 都相同并使用 POST 请求时,您可以使用scrapy.FormRequest()而不是scrapy.Request(),两者都是相同的,但 FormRequest向构造函数添加了一个新参数(formdata=)。
Here is another spider example form this post:
这是这篇文章的另一个蜘蛛示例:
class SpiderClass(scrapy.Spider):
# spider name and all
name = 'ajax'
page_incr = 1
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1']
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'
def parse(self, response):
sel = Selector(response)
if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))
# your code here
# pagination code starts here
if sel.xpath('//div[@class="panel-wrapper"]'):
self.page_incr += 1
formdata = {
'sorter': 'recent',
'location': 'main loop',
'loop': 'main loop',
'action': 'sort',
'view': 'grid',
'columns': '3',
'paginated': str(self.page_incr),
'currentquery[category_name]': 'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return