Javascript 是否可以用javascript编写网络爬虫?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11083522/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 04:22:12  来源:igfitidea点击:

is it possible to write web crawler in javascript?

javascriptweb-crawler

提问by Ashwin Mendon

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

我想抓取页面并检查相应页面中的超链接,并跟踪这些超链接并从页面中捕获数据

回答by apsillers

Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.

通常,浏览器 JavaScript 只能在其来源的域内爬行,因为获取页面将通过Ajax完成,这受到同源策略的限制。

If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Originheader is set for pages on the other server).

如果运行爬虫脚本的页面在www.example.com 上,那么该脚本可以爬取 www.example.com 上的所有页面,但不能爬取任何其他来源的页面(除非某些边缘情况适用,例如Access-为其他服务器上的页面设置了Control-Allow-Origin标头)。

If you reallywant to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensionsare packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)

如果你真的想在浏览器 JS 中编写一个功能齐全的爬虫,你可以编写一个浏览器扩展:例如,Chrome 扩展是打包的具有特殊权限的 Web 应用程序运行,包括跨域 Ajax。这种方法的难点在于,如果您想支持多个浏览器,就必须编写多个版本的爬虫。(如果爬虫仅供个人使用,那可能不是问题。)

回答by Bogdan Emil Mariesan

If you use server-side javascript it is possible. You should take a look at node.js

如果您使用服务器端 javascript,这是可能的。你应该看看node.js

And an example of a crawler can be found in the link bellow:

在下面的链接中可以找到一个爬虫的例子:

http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx

http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx

回答by Arun

We could crawl the pages using Javascript from server side with help of headless webkit. For crawling, we have few libraries like PhantomJS, CasperJS, also there is a new wrapper on PhantomJS called Nightmare JS which make the works easier.

我们可以在无头 webkit 的帮助下从服务器端使用 Javascript 抓取页面。对于爬虫,我们很少有像 PhantomJS、CasperJS 这样的库,而且 PhantomJS 上还有一个名为 Nightmare JS 的新包装器,使工作更容易。

回答by Tom

There are ways to circumvent the same-origin policy with JS. I wrote a crawler for facebook, that gathered information from facebook profiles from my friends and my friend's friends and allowed filtering the results by gender, current location, age, martial status (you catch my drift). It was simple. I just ran it from console. That way your script will get privilage to do request on the current domain. You can also make a bookmarklet to run the script from your bookmarks.

有一些方法可以绕过 JS 的同源策略。我为 facebook 编写了一个爬虫,它从我朋友和我朋友的朋友的 facebook 个人资料中收集信息,并允许按性别、当前位置、年龄、军事状态(你明白我的意思)过滤结果。这很简单。我只是从控制台运行它。这样您的脚本将获得在当前域上执行请求的权限。您还可以制作书签以从书签运行脚本。

Another way is to provide a PHP proxy. Your script will access the proxy on current domain and request files from another with PHP. Just be carefull with those. These might get hiHymaned and used as a public proxy by 3rd party if you are not carefull.

另一种方法是提供一个 PHP 代理。您的脚本将访问当前域上的代理并使用 PHP 从另一个域请求文件。小心那些。如果您不小心,这些可能会被 3rd 方劫持并用作公共代理。

Good luck, maybe you make a friend or two in the process like I did :-)

祝你好运,也许你会像我一样在这个过程中交到一两个朋友:-)

回答by user3801836

This is what you need http://zugravu.com/products/web-crawler-spider-scraping-javascript-regular-expression-nodejs-mongodbThey use NodeJS, MongoDB and ExtJs as GUI

这就是你需要的http://zugravu.com/products/web-crawler-spider-scraping-javascript-regular-expression-nodejs-mongodb他们使用 NodeJS、MongoDB 和 ExtJs 作为 GUI

回答by Natan Streppel

Google's Chrome team has released puppeteeron August 2017, a node library which provides a high-level API for both headless and non-headless Chrome (headless Chrome being available since 59).

谷歌的 Chrome 团队于 2017 年 8 月发布了 puppeteer,这是一个节点库,为无头和非无头 Chrome 提供高级 API(无头 Chrome自 59开始可用)。

It uses an embedded version of Chromium, so it is guaranteed to work out of the box. If you want to use an specific Chrome version, you can do so by launching puppeteer with an executable path as parameter, such as:

它使用 Chromium 的嵌入式版本,因此可以保证开箱即用。如果你想使用特定的 Chrome 版本,你可以通过以可执行路径作为参数启动 puppeteer 来实现,例如:

const browser = await puppeteer.launch({executablePath: '/path/to/Chrome'});

An example of navigating to a webpage and taking a screenshot out of it shows how simple it is (taken from the GitHub page):

导航到网页并从中截取屏幕截图的示例显示了它是多么简单(取自 GitHub 页面):

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  await page.screenshot({path: 'example.png'});

  await browser.close();
})();

回答by Maciej Jankowski

My typical setup is to use a browser extension with cross origin privileges set, which is injecting both the crawler code and jQuery.

我的典型设置是使用具有跨域权限集的浏览器扩展,它同时注入爬虫代码和 jQuery。

Another take on Javascript crawlers is to use a headless browser like phantomJS or casperJS (which boosts phantom's powers)

对 Javascript 爬虫的另一种看法是使用无头浏览器,如 phantomJS 或 casperJS(这增强了 phantom 的能力)

回答by hfarazm

yes it is possible

对的,这是可能的

  1. Use NODEJS (its server side JS)
  2. There is NPM (package manager that handles 3rd party modules) in nodeJS
  3. Use PhantomJS in NodeJS (third party module that can crawl through websites is PhantomJS)
  1. 使用 NODEJS(它的服务器端 JS)
  2. nodeJS 中有 NPM(处理第 3 方模块的包管理器)
  3. 在NodeJS中使用PhantomJS(可以爬取网站的第三方模块是PhantomJS)

回答by farhang67

There is a client side approach for this, using Firefox Greasemonkey extention. with Greasemonkey you can create scripts to be executed each time you open specified urls.

对此有一种客户端方法,使用 Firefox Greasemonkey 扩展。使用 Greasemonkey,您可以创建每次打开指定 url 时要执行的脚本。

here an example:

这里有一个例子:

if you have urls like these:

如果你有这样的网址:

http://www.example.com/products/pages/1

http://www.example.com/products/pages/1

http://www.example.com/products/pages/2

http://www.example.com/products/pages/2

then you can use something like this to open all pages containing product list(execute this manually)

然后你可以使用这样的东西来打开所有包含产品列表的页面(手动执行)

var j = 0;
for(var i=1;i<5;i++)
{ 
  setTimeout(function(){
  j = j + 1;
  window.open('http://www.example.com/products/pages/ + j, '_blank');

}, 15000 * i);

}

}

then you can create a script to open all products in new window for each product list page and include this url in Greasemonkey for that.

然后你可以创建一个脚本来在每个产品列表页面的新窗口中打开所有产品,并为此在 Greasemonkey 中包含这个 url。

http://www.example.com/products/pages/*

http://www.example.com/products/pages/*

and then a script for each product page to extract data and call a webservice passing data and close window and so on.

然后为每个产品页面编写一个脚本来提取数据并调用 webservice 传递数据和关闭窗口等。

回答by Fan Jin

I made an example javascript crawler on github.

我在 github 上做了一个示例 javascript 爬虫。

It's event driven and use an in-memory queue to store all the resources(ie. urls).

它是事件驱动的,并使用内存队列来存储所有资源(即 url)。

How to use in your node environment

如何在你的节点环境中使用

var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');

// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();

Here I'm just showing you 2 core method of a javascript crawler.

在这里,我只是向您展示 javascript 爬虫的 2 个核心方法。

Crawler.prototype.run = function() {
  var crawler = this;
  process.nextTick(() => {
    //the run loop
    crawler.crawlerIntervalId = setInterval(() => {

      crawler.crawl();

    }, crawler.crawlInterval);
    //kick off first one
    crawler.crawl();
  });

  crawler.running = true;
  crawler.emit('start');
}


Crawler.prototype.crawl = function() {
  var crawler = this;

  if (crawler._openRequests >= crawler.maxListenerCurrency) return;


  //go get the item
  crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
    if (queueItem) {
      //got the item start the fetch
      crawler.fetchQueueItem(queueItem, index);
    } else if (crawler._openRequests === 0) {
      crawler.queue.complete((err, completeCount) => {
        if (err)
          throw err;
        crawler.queue.getLength((err, length) => {
          if (err)
            throw err;
          if (length === completeCount) {
            //no open Request, no unfetcheditem stop the crawler
            crawler.emit("complete", completeCount);
            clearInterval(crawler.crawlerIntervalId);
            crawler.running = false;
          }
        });
      });
    }

  });
};

Here is the github link https://github.com/bfwg/node-tinycrawler. It is a javascript web crawler written under 1000 lines of code. This should put you on the right track.

这是 github 链接 https://github.com/bfwg/node-tinycrawler。它是一个用 1000 行代码编写的 javascript 网络爬虫。这应该让你走上正轨。