javascript 如何解析 DOM (REACT)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29972996/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse DOM (REACT)
提问by quantum285
I am trying to scrape data from a website. The website uses Facebook's React. As such the source code that I can parse using Jauntis completely different to the code I see when inspecting the elements using Chrome's inspector.
我正在尝试从网站上抓取数据。该网站使用 Facebook 的 React。因此,我可以使用Jaunt解析的源代码与我使用 Chrome 的检查器检查元素时看到的代码完全不同。
I know very little about all of this, but having done some research I think this is something to do with DOM rather than the source code. I need a way to be able to get my hands on this DOM code as the original source contains nothing I want, but I don't have the foggiest idea where to begin (even having read many answers on here).
我对这一切知之甚少,但做了一些研究后,我认为这与 DOM 而非源代码有关。我需要一种方法来获取这个 DOM 代码,因为原始源代码不包含我想要的任何内容,但我不知道从哪里开始(即使在这里阅读了很多答案)。
Hereis an example of one the pages I want to scrape. For example to scrape the description I'd want to grab what is in between the tag:
这是我想抓取的页面的示例。例如,要抓取描述,我想获取标签之间的内容:
<span class="light-font extended-card-description list-group-item">Example description....</span>
But as you can see this element only appears when you "Inspect Element", and not when I just view the page's source.
但是正如您所看到的,此元素仅在您“检查元素”时出现,而在我查看页面源时不会出现。
My question to you geniuses on here is, how can I grab this DOM Code and start scraping the elements I actually want to?
我想问各位天才的问题是,我怎样才能获取这个 DOM 代码并开始抓取我真正想要的元素?
Forgive me if my terminology is completely off but as I say this is a completely new area for me, and I've done the research that I can.
如果我的术语完全偏离,请原谅我,但正如我所说,这对我来说是一个全新的领域,我已经完成了我能做的研究。
Thank you very much in advance!
非常感谢您提前!
回答by Tobia
ReactJS, like many other Javascript libraries / frameworks, uses client-side code (Javascript) to render the final HTML. This means that when you, Jaunt, or your browser fetch the HTML source code from the server, it doesn't yet contain the final code the user will see. The browser needs to run the Javascript program(s) contained in the page, in order to generate the final content you wish to scrape.
ReactJS 与许多其他 Javascript 库/框架一样,使用客户端代码 (Javascript) 来呈现最终的 HTML。这意味着当您、Jaunt 或您的浏览器从服务器获取 HTML 源代码时,它尚未包含用户将看到的最终代码。浏览器需要运行页面中包含的 Javascript 程序,以生成您希望抓取的最终内容。
My favorite tool for this kind of job is CasperJS
我最喜欢的这类工作的工具是CasperJS
It (or rather the PhantomJS tool that CasperJS uses) is a headless browser, meaning it's a version of Webkit (like Chrome or Safari) that has been stripped of all the GUI (windows, buttons, menus.) What's left is a tool that you can run from a terminal or from your Java program. It won't show any window on the screen, but it will fetch the webpages you ask it to; run any Javascript they contain; and then respond to your commands, such as "click on this link", "give me that text", "capture a screenshot", and so on.
它(或者更确切地说是 CasperJS 使用的 PhantomJS 工具)是一个无头浏览器,这意味着它是一个 Webkit 版本(如 Chrome 或 Safari),已经去除了所有 GUI(窗口、按钮、菜单)。剩下的是一个工具您可以从终端或 Java 程序运行。它不会在屏幕上显示任何窗口,但会获取您要求的网页;运行它们包含的任何 Javascript;然后响应您的命令,例如“单击此链接”、“给我该文本”、“截取屏幕截图”等。
Let's start with a simple ReactJS example:
让我们从一个简单的ReactJS 示例开始:
We want to scrape the "Hello John" text, but if you look at the plain HTML source (Ctrl+Uor Alt+Ctrl+U) you won't see it. On the other hand, if you open the console in your browser and use the following selector, you will get the text:
我们想抓取“Hello John”文本,但是如果您查看纯 HTML 源代码(Ctrl+U或Alt+ Ctrl+ U),您将看不到它。另一方面,如果您在浏览器中打开控制台并使用以下选择器,您将获得文本:
> document.querySelector('#helloExample .playgroundPreview').textContent
"Hello John"
Here is a simple CasperJS script to do the same thing:
这是一个简单的 CasperJS 脚本来做同样的事情:
var casper = require("casper").create();
casper.start("http://facebook.github.io/react/index.html", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
casper.run();
You can save it as hello.js
and execute it with casperjs hello.js
from a terminal, or use the equivalent Java code Runtime.getRuntime().exec(...)
您可以将其另存为hello.js
并casperjs hello.js
从终端执行,或使用等效的 Java 代码Runtime.getRuntime().exec(...)
Here is a better script, that avoids loading images and third-party resources (such as Facebook button, Twitter button, Google Analytics, and such) cutting the loading time by half. It also adds a waitForSelector
step, so that we don't risk trying to fetch the text before ReactJS has had a chance to create it.
这是一个更好的脚本,它避免加载图像和第三方资源(例如 Facebook 按钮、Twitter 按钮、Google Analytics 等),从而将加载时间减少了一半。它还添加了一个waitForSelector
步骤,因此我们不会冒险尝试在 ReactJS 有机会创建文本之前获取文本。
var casper = require("casper").create({
pageSettings: {
loadImages: false
}
});
casper.on('resource.requested', function(requestData, request) {
if (requestData.url.indexOf("http://facebook.github.io/") != 0) {
request.abort();
}
});
casper.start("http://facebook.github.io/react/index.html", function() {
this.waitForSelector("#helloExample .playgroundPreview", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
});
casper.run();
How to install CasperJS
如何安装 CasperJS
I have had some trouble scraping ReactJS and other modern Javascript pages with the older versions of PhantomJS and CasperJS, so I recommend installing PhantomJS 2.0 and the latest CasperJS from GitHub.
我在使用旧版本的 PhantomJS 和 CasperJS 抓取 ReactJS 和其他现代 Javascript 页面时遇到了一些麻烦,因此我建议从 GitHub 安装 PhantomJS 2.0 和最新的 CasperJS。
For PhantomJS you can just download the official 2.0 package.
对于 PhantomJS,您可以下载官方 2.0 包。
For CasperJS, since it's a Python script, you should be able to check out the latest commit from GitHub and link bin/casperjs
onto your PATH. Here's a script for Linux or Mac OS X:
对于 CasperJS,由于它是一个 Python 脚本,您应该能够从 GitHub 中查看最新提交并链接bin/casperjs
到您的 PATH。这是适用于 Linux 或 Mac OS X 的脚本:
> git clone git://github.com/n1k0/casperjs.git
> cd casperjs
> ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs
You may also want to comment out the line printing Warning PhantomJS v2.0 ...
from your bin/bootstrap.js
file.
您可能还想Warning PhantomJS v2.0 ...
从bin/bootstrap.js
文件中注释掉打印行。