抓取 HTML 和 JavaScript

Question

提问by user1934948

I am working on a project in which I need to crawl several websites and gather different kinds of information from them. Information like text, links, images, etc.

我正在从事一个项目，我需要在其中抓取多个网站并从中收集不同类型的信息。文本、链接、图像等信息。

I am using Python for this. I have tried BeautifulSoup for this purpose on the HTML pages and it works, but I am stuck when parsing sites which contains a lot of JavaScript, as most of the information on these files is stored in the <script>tag.

我为此使用 Python。我已经在 HTML 页面上为此目的尝试过 BeautifulSoup 并且它有效，但是在解析包含大量 JavaScript 的站点时我被卡住了，因为这些文件的大部分信息都存储在<script>标签中。

Any ideas how to do this?

任何想法如何做到这一点？

Answer 1

采纳答案by bosnjak

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would.
The only difference is that its main interface is not GUI/HMI but an API.

首先，从页面中抓取和解析 JS 并非易事。但是，如果您改用无头 Web 客户端，它可以大大简化，它将像常规浏览器一样为您解析所有内容。
唯一不同的是它的主界面不是GUI/HMI，而是API。

For example, you can use PhantomJSwith Chrome or Firefox which both support headless mode.

例如，您可以将PhantomJS与 Chrome 或 Firefox 一起使用，它们都支持无头模式。

For a more complete list of headless browsers check here.

如需更完整的无头浏览器列表，请查看此处。

Answer 2

回答by alecxe

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

如果页面加载中涉及大量 javascript 动态加载，事情就会变得更加复杂。

Basically, you have 3 ways to crawl the data from the website:

基本上，您有 3 种方法可以从网站上抓取数据：

using browser developer tools see what AJAXrequests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of jsonand requestsmodules.
use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headlessbrowser too.
see if the website provides an API (e.g. walmart API)

使用浏览器开发人员工具查看页面加载中的AJAX请求。然后在你的爬虫中模拟这些请求。您可能需要json和requests模块的帮助。
使用使用真正浏览器的工具，如selenium。在这种情况下，您不关心页面是如何加载的 - 您将获得真实用户所看到的内容。注意：您也可以使用无头浏览器。
查看网站是否提供API（例如walmart API）

Also take a look at Scrapyweb-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

还可以看看Scrapy网络抓取框架——它也不处理 AJAX 调用，但这确实是我曾经使用过的网络抓取世界中最好的工具。

Also see these resources:

另请参阅这些资源：

Hope that helps.

希望有帮助。

Answer 3

回答by Ehvince

To get you started with selenium and BeautifulSoup:

要让您开始使用 selenium 和 BeautifulSoup：

Install phantomjs with npm (Node Package Manager):

使用 npm（节点包管理器）安装 phantomjs：

apt-get install nodejs
npm install phantomjs

install selenium:

安装硒：

pip install selenium

and get the resulted page like this, and parse it with beautifulSoup as usual:

并得到这样的结果页面，并像往常一样用beautifulSoup解析它：

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

Answer 4

回答by Eduard Florinescu

A very fast way would be to iterate through all the tags and get textContentThis is the JS snippet:

一种非常快速的方法是遍历所有标签并获取textContent这是 JS 代码段：

page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent;

or in selenium/python:

或在硒/蟒蛇：

import selenium
from selenium import webdriver
driver = webdriver.Chrome()

driver.get("http://ranprieur.com")
pagetext = driver.execute_script('page =""; var all = document.getElementsByTagName("*"); for (tag of all) page = page + tag.textContent; return page;')

抓取 HTML 和 JavaScript

提问by user1934948

采纳答案by bosnjak

回答by alecxe

回答by Ehvince

回答by Eduard Florinescu

相关推荐

最近更新

标签

抓取 HTML 和 JavaScript

提问by user1934948

采纳答案by bosnjak

回答by alecxe

回答by Ehvince

回答by Eduard Florinescu

相关推荐

javascript 从javascript重置文件上传控制

javascript 将 npm 模块编译成单个文件，无依赖

javascript 在对象数组中查找具有 id 属性最大值的对象

javascript 在discreteBarChart nvd3.js 中自定义带有数据的工具提示内容

相关推荐

最近更新

标签