使用python抓取ajax页面

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16390257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:28:09  来源:igfitidea点击:

Scraping ajax pages using python

pythonajaxweb-scrapingscreen-scrapingscrapy

提问by Lynob

I've already seen this question about scraping ajax, but python isn't mentioned there. I considered using scrapy, i believe they have some docs on that subject, but as you can see the website is down. So i don't know what to do. I want to do the following:

我已经看到了这个关于抓取 ajax 的问题,但是那里没有提到 python。我考虑过使用scrapy,我相信他们有一些关于该主题的文档,但正如您所看到的,该网站已关闭。所以我不知道该怎么办。我想做以下事情:

I only have one url, example.com you go from page to page by clicking submit, the url doesn't change since they're using ajax to display the content. I want to scrape the content of each page, how to do it?

我只有一个 url,example.com,您可以通过单击提交从一个页面转到另一个页面,该 url 不会更改,因为他们使用 ajax 来显示内容。我想抓取每个页面的内容,怎么做?

Lets say that i want to scrape only the numbers, is there anything other than scrapy that would do it? If not, would you give me a snippet on how to do it, just because their website is down so i can't reach the docs.

假设我只想抓取数字,除了scrapy之外还有什么可以做到的吗?如果没有,你能给我一个关于如何做的片段,只是因为他们的网站已经关闭,所以我无法访问文档。

采纳答案by alecxe

First of all, scrapy docs are available at https://scrapy.readthedocs.org/en/latest/.

首先,scrapy 文档可在https://scrapy.readthedocs.org/en/latest/ 获得

Speaking about handling ajax while web scraping. Basically, the idea is rather simple:

谈到在网页抓取时处理 ajax。基本上,这个想法相当简单:

  • open browser developer tools, network tab
  • go to the target site
  • click submit button and see what XHRrequestis going to the server
  • simulate this XHRrequest in your spider
  • 打开浏览器开发者工具,网络选项卡
  • 转到目标站点
  • 单击提交按钮,查看XHR发送到服务器的请求
  • XHR在你的蜘蛛中模拟这个请求

Also see:

另见:

Hope that helps.

希望有帮助。

回答by Malak

I found the answer very useful but I would like to make it more simple.

我发现答案非常有用,但我想让它更简单。

response = requests.post(request_url, data=payload, headers=request_headers)

request.post takes three parameters url, data and headers. Values for these three attributes can be found in the XHR request.

request.post 接受三个参数 url、data 和 headers。这三个属性的值可以在 XHR 请求中找到。

Copy the whole request header and form data to load into the above variables and you are good to go

复制整个请求头和表单数据以加载到上述变量中​​,您就可以开始了