抓取 HTML 和 JavaScript
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22764322/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scraping HTML and JavaScript
提问by user1934948
I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.
我正在从事一个项目,我需要在其中抓取多个网站并从中收集不同类型的信息。文本、链接、图像等信息。
I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script>
tag.
我为此使用 Python。我已经在 HTML 页面上为此目的尝试过 BeautifulSoup 并且它有效,但是在解析包含大量 JavaScript 的站点时我被卡住了,因为这些文件的大部分信息都存储在<script>
标签中。
Any ideas how to do this?
任何想法如何做到这一点?
采纳答案by bosnjak
First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.
首先,从页面中抓取和解析 JS 并非易事。但是,如果您改用无头 Web 客户端,它可以大大简化,它将像常规浏览器一样为您解析所有内容。
唯一不同的是它的主界面不是GUI/HMI,而是API。
For example, you can use PhantomJSwith Chrome or Firefox which both support headless mode.
例如,您可以将PhantomJS与 Chrome 或 Firefox 一起使用,它们都支持无头模式。
For a more complete list of headless browsers check here.
如需更完整的无头浏览器列表,请查看此处。
回答by alecxe
If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.
如果页面加载中涉及大量 javascript 动态加载,事情就会变得更加复杂。
Basically, you have 3 ways to crawl the data from the website:
基本上,您有 3 种方法可以从网站上抓取数据:
- using browser developer tools see what AJAXrequests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of jsonand requestsmodules.
- use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headlessbrowser too.
- see if the website provides an API (e.g. walmart API)
- 使用浏览器开发人员工具查看页面加载中的AJAX请求。然后在你的爬虫中模拟这些请求。您可能需要json和requests模块的帮助。
- 使用使用真正浏览器的工具,如selenium。在这种情况下,您不关心页面是如何加载的 - 您将获得真实用户所看到的内容。注意:您也可以使用无头浏览器。
- 查看网站是否提供API(例如walmart API)
Also take a look at Scrapyweb-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.
还可以看看Scrapy网络抓取框架——它也不处理 AJAX 调用,但这确实是我曾经使用过的网络抓取世界中最好的工具。
Also see these resources:
另请参阅这些资源:
- Web-scraping JavaScript page with Python
- Scraping javascript-generated data using Python
- web scraping dynamic content with python
- How to use Selenium with Python?
- Headless Selenium Testing with Python and PhantomJS
- selenium with scrapy for dynamic page
- 使用 Python 抓取网页 JavaScript 页面
- 使用 Python 抓取 JavaScript 生成的数据
- 用python抓取动态内容
- 如何在 Python 中使用 Selenium?
- 使用 Python 和 PhantomJS 进行无头 Selenium 测试
- selenium 与scrapy 用于动态页面
Hope that helps.
希望有帮助。
回答by Ehvince
To get you started with selenium and BeautifulSoup:
要让您开始使用 selenium 和 BeautifulSoup:
Install phantomjs with npm (Node Package Manager):
使用 npm(节点包管理器)安装 phantomjs:
apt-get install nodejs
npm install phantomjs
install selenium:
安装硒:
pip install selenium
and get the resulted page like this, and parse it with beautifulSoup as usual:
并得到这样的结果页面,并像往常一样用beautifulSoup解析它:
from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)
回答by Eduard Florinescu
A very fast way would be to iterate through all the tags and get textContent
This is the JS snippet:
一种非常快速的方法是遍历所有标签并获取textContent
这是 JS 代码段:
page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent;
or in selenium/python:
或在硒/蟒蛇:
import selenium
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://ranprieur.com")
pagetext = driver.execute_script('page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; return page;')