Javascript 使用 jQuery 的简单屏幕抓取

Question

提问by Rion Williams

I have been playing with the idea of using a simple screen-scraper using jQuery and I am wondering if the following is possible.

我一直在玩使用 jQuery 使用简单屏幕抓取器的想法，我想知道以下是否可行。

I have simple HTML page and am making an attempt (if this is possible) to grab the contents of all of the list items from another page, like so:

我有一个简单的 HTML 页面，并且正在尝试（如果可能的话）从另一个页面获取所有列表项的内容，如下所示：

Main Page:

主页：

<!-- jQuery -->
<script type='text/javascript'>
$(document).ready(function(){
$.getJSON("[URL to other page]",
  function(data){

    //Iterate through the <li> inside of the URL's data
    $.each(data.items, function(item){
      $("<li/>").value().appendTo("#data");
    });

  });
});
</script>

<!-- HTML -->
<html>
    <body>
       <div id='data'></div>
    </body>
</html>

Other Page:

其他页面：

//Html
<body>
    <p><b>Items to Scrape</b></p>   
    <ul>
        <li>I want to scrape what is here</li>
        <li>and what is here</li>
        <li>and here as well</li>
        <li>and append it in the main page</li>
    </ul>
</body>

So, is it possible using jQuery to pull all of the list item contents from an external page and append them inside of a div?

那么，是否可以使用 jQuery 从外部页面中提取所有列表项内容并将它们附加到 div 中？

Answer 1

采纳答案by Ry-

Use $.ajaxto load the other page into a variable, then create a temporary element and use .html()to set the contents to the value returned. Loop through the element's children of nodeType 1 and keep their first children's nodeValues. If the external page is not on your web server you will need to proxy the file with your own web server.

用于$.ajax将其他页面加载到变量中，然后创建一个临时元素并用于.html()将内容设置为返回的值。循环遍历 nodeType 1 元素的子元素并保留它们的第一个子元素的 nodeValues。如果外部页面不在您的 Web 服务器上，您将需要使用您自己的 Web 服务器代理该文件。

Something like this:

像这样的东西：

$.ajax({
     url: "/thePageToScrape.html",
     dataType: 'text',
     success: function(data) {
          var elements = $("<div>").html(data)[0].getElementsByTagName("ul")[0].getElementsByTagName("li");
          for(var i = 0; i < elements.length; i++) {
               var theText = elements[i].firstChild.nodeValue;
               // Do something here
          }
     }
});

Answer 2

回答by Fareesh Vijayarangam

$.get("/path/to/other/page",function(data){
  $('#data').append($('li',data));
}

Answer 3

回答by hoju

If this is for the same domain then no problem - the jQuery solution is good.

如果这是针对同一个域，那么没问题 - jQuery 解决方案很好。

But otherwise you can't access content from an arbitrary website because this is considered a security risk. See same origin policy.

但是，否则您将无法从任意网站访问内容，因为这被视为存在安全风险。请参阅同源策略。

There are of course server side workarounds such as a web proxy or CORS headers. Of if you're lucky they will support jsonp.

当然还有服务器端的解决方法，例如 Web 代理或CORS 标头。如果幸运的话，他们会支持 jsonp。

But if you want a client side solution to work with an arbitrary website and web browser then you are out of luck. There is a proposal to relax this policy, but this won't effect current web browsers.

但是，如果您希望客户端解决方案与任意网站和 Web 浏览器一起使用，那么您就不走运了。有人提议放宽此政策，但这不会影响当前的网络浏览器。

Answer 4

回答by Camilo Martin

You may want to consider pjscrape:

您可能需要考虑 pjscrape：

http://nrabinowitz.github.io/pjscrape/

It allows you to do this from the command-line, using javascript and jQuery. It does this by using PhantomJS, which is a headless webkit browser (it has no window, and it exists only for your script's usage, so you can load complex websites that use AJAX and it will work just as if it were a real browser).

它允许您使用 javascript 和 jQuery 从命令行执行此操作。它通过使用 PhantomJS 来实现这一点，它是一个无头 webkit 浏览器（它没有窗口，它只存在于您的脚本使用中，因此您可以加载使用 AJAX 的复杂网站，它会像真正的浏览器一样工作） .

The examples are self-explanatory and I believe this works on all platforms (including Windows).

这些示例一目了然，我相信这适用于所有平台（包括 Windows）。

Answer 5

回答by shramee

Simple scraping with jQuery...

使用 jQuery 进行简单抓取...

// Get HTML from page
$.get( 'http://example.com/', function( html ) {

    // Loop through elements you want to scrape content from
    $(html).find("ul").find("li").each( function(){

        var text = $(this).text();
        // Do something with content

    } )

} );

Answer 6

回答by Kurkula

I am sure you will hit the CORS issue with requests in many cases. From heretry to resolve CORS issue.

我相信在许多情况下，您会遇到请求的 CORS 问题。从这里尝试解决 CORS 问题。

var name = "kk";
var url = "http://anyorigin.com/go?url=" + encodeURIComponent("https://www.yoursite.xyz/") + name + "&callback=?";
$.get(url, function(response) {
  console.log(response);
});

Answer 7

回答by Skizz

Use YQL or Yahoo pipes to make the cross domain request for the raw page html content. The yahoo pipe or YQL query will spit this back as a JSON that can be processed by jquery to extract and display the required data.

使用 YQL 或 Yahoo 管道对原始页面 html 内容进行跨域请求。雅虎管道或 YQL 查询会将其作为 JSON 返回，jquery 可以处理该 JSON 以提取和显示所需的数据。

On the downside: YQL and Yahoo pipes OBEY the robots.txt file for the target domain and if the page is to long the Yahoo Pipes regex commands will not run.

不利的一面是：YQL 和 Yahoo 管道遵守目标域的 robots.txt 文件，如果页面太长，Yahoo Pipes 正则表达式命令将不会运行。

Javascript 使用 jQuery 的简单屏幕抓取

提问by Rion Williams

采纳答案by Ry-

回答by Fareesh Vijayarangam

回答by hoju

回答by Camilo Martin

回答by shramee

回答by Kurkula

回答by Skizz

相关推荐

最近更新

标签

Javascript 使用 jQuery 的简单屏幕抓取

提问by Rion Williams

采纳答案by Ry-

回答by Fareesh Vijayarangam

回答by hoju

回答by Camilo Martin

回答by shramee

回答by Kurkula

回答by Skizz

相关推荐

Javascript ReactJS 中 this.state 和 this.setstate 的区别是什么？

Javascript 为什么在这个简单的 addEventListener 函数之后使用“false”？

Javascript ReactJS：预期 onClick 侦听器是一个函数，而不是类型字符串

Javascript 没有 jquery 的 jquery 'trigger' 方法的等价物是什么？

相关推荐

最近更新

标签