使用python抓取ajax页面
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16390257/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scraping ajax pages using python
提问by Lynob
I've already seen this question about scraping ajax, but python isn't mentioned there. I considered using scrapy, i believe they have some docs on that subject, but as you can see the website is down. So i don't know what to do. I want to do the following:
我已经看到了这个关于抓取 ajax 的问题,但是那里没有提到 python。我考虑过使用scrapy,我相信他们有一些关于该主题的文档,但正如您所看到的,该网站已关闭。所以我不知道该怎么办。我想做以下事情:
I only have one url, example.com you go from page to page by clicking submit, the url doesn't change since they're using ajax to display the content. I want to scrape the content of each page, how to do it?
我只有一个 url,example.com,您可以通过单击提交从一个页面转到另一个页面,该 url 不会更改,因为他们使用 ajax 来显示内容。我想抓取每个页面的内容,怎么做?
Lets say that i want to scrape only the numbers, is there anything other than scrapy that would do it? If not, would you give me a snippet on how to do it, just because their website is down so i can't reach the docs.
假设我只想抓取数字,除了scrapy之外还有什么可以做到的吗?如果没有,你能给我一个关于如何做的片段,只是因为他们的网站已经关闭,所以我无法访问文档。
采纳答案by alecxe
First of all, scrapy docs are available at https://scrapy.readthedocs.org/en/latest/.
首先,scrapy 文档可在https://scrapy.readthedocs.org/en/latest/ 获得。
Speaking about handling ajax while web scraping. Basically, the idea is rather simple:
谈到在网页抓取时处理 ajax。基本上,这个想法相当简单:
- open browser developer tools, network tab
- go to the target site
- click submit button and see what
XHRrequestis going to the server - simulate this
XHRrequest in your spider
Also see:
另见:
- Can scrapy be used to scrape dynamic content from websites that are using AJAX?
- Pagination using scrapy
Hope that helps.
希望有帮助。
回答by Malak
I found the answer very useful but I would like to make it more simple.
我发现答案非常有用,但我想让它更简单。
response = requests.post(request_url, data=payload, headers=request_headers)
request.post takes three parameters url, data and headers. Values for these three attributes can be found in the XHR request.
request.post 接受三个参数 url、data 和 headers。这三个属性的值可以在 XHR 请求中找到。
Copy the whole request header and form data to load into the above variables and you are good to go
复制整个请求头和表单数据以加载到上述变量中,您就可以开始了

