Java 如何“扫描”网站(或页面)以获取信息,并将其带入我的程序?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2835505/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 13:25:42  来源:igfitidea点击:

How to "scan" a website (or page) for info, and bring it into my program?

javahtmlweb-scrapingjsoup

提问by James

Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).

嗯,我非常想弄清楚如何从网页中提取信息,并将其带入我的程序(在 Java 中)。

For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?

例如,如果我知道我想要信息的确切页面,为了简单起见,Best Buy 项目页面,我如何从该页面获取我需要的适当信息?喜欢标题,价格,描述?

What would this process even be called? I have no idea were to even begin researching this.

这个过程甚至会被称为什么?我什至不知道要开始研究这个。

Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:

编辑:好的,我正在对 JSoup(BalusC 发布的那个)进行测试,但我不断收到此错误:

Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)

I do have Apache Commons

我确实有 Apache Commons

采纳答案by BalusC

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Javasince it supportsjQuerylike CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterableso that you can iterate over it in an enhanced for loop(so there's no need to hassle with verbose Nodeand NodeListlike classes in the average Java DOM parser).

使用像Jsoup这样的 HTML 解析器。这比Java 中可用其他 HTML 解析器更具有我的偏好,因为它支持CSS 选择器这样的jQuery。此外,它的表示节点列表的类, 实现了这样你可以在一个增强的 for 循环中迭代它(所以没有必要在普通的 Java DOM 解析器中处理冗长和类似的类)。ElementsIterableNodeNodeList

Here's a basic kickoff example (just put the latest Jsoup JAR filein classpath):

这是一个基本的启动示例(只需将最新的 Jsoup JAR 文件放在类路径中):

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

As you might have guessed, this prints your own question and the names of all answerers.

您可能已经猜到了,这会打印您自己的问题和所有回答者的姓名。

回答by Roman

You may use an html parser (many useful links here: java html parser).

您可以使用 html 解析器(这里有许多有用的链接:java html parser)。

The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.

该过程称为“抓取网站内容”。搜索“抓取网站内容 java”以获取进一步的信息。

回答by Nelson

Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.

查看 cURL 库。我从未在 Java 中使用过它,但我确信它必须有绑定。基本上,您要做的是向您想要“抓取”的任何页面发送一个 cURL 请求。该请求将向页面返回一个带有源代码的字符串。从那里,您将使用正则表达式从源代码中解析您想要的任何数据。这通常是您将要执行的操作。

回答by sblundy

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

这被称为屏幕抓取,维基百科有这篇关于更具体的网页抓取的文章。这可能是一个重大挑战,因为那里有一些丑陋的、混乱的、坏掉的 HTML,如果不是因为浏览器聪明,那么祝你好运。

回答by Kurru

You'd probably want to look at the HTML to see if you can find strings that are unique and near your text, then you can use line/char-offsets to get to the data.

您可能想查看 HTML 以查看是否可以找到唯一且靠近文本的字符串,然后您可以使用 line/char-offsets 来获取数据。

Could be awkward in Java, if there aren't any XML classes similar to the ones found in System.XML.Linqin C#.

如果没有任何类似于System.XML.LinqC# 中的XML 类,在 Java 中可能会很尴尬。

回答by mdma

I would use JTidy- it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.

我会使用JTidy- 它类似于 JSoup,但我不太了解 JSoup。JTidy 处理损坏的 HTML 并返回一个 w3c 文档,因此您可以使用它作为 XSLT 的源来提取您真正感兴趣的内容。如果您不了解 XSLT,那么您不妨使用 JSoup,作为文档模型比 w3c 更好用。

EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.

编辑:在 JSoup 网站上快速浏览显示 JSoup 可能确实是更好的选择。它似乎支持开箱即用的 CSS 选择器,用于从文档中提取内容。这可能比使用 XSLT 容易得多。

回答by Anton

JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf

JSoup 解决方案很棒,但是如果您只需要提取一些非常简单的东西,使用 regex 或 String.indexOf 可能会更容易

As others have already mentioned the process is called scraping

正如其他人已经提到的,该过程称为刮擦

回答by Kalpesh Soni

jsoup supports java 1.5

jsoup 支持 java 1.5

https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3

https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3

looks like that stack was a bug, and has been fixed

看起来那个堆栈是一个错误,并已修复

回答by lipido

You could also try jARVEST.

你也可以试试jARVEST

It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites.

它基于 JRuby DSL 和纯 Java 引擎,用于蜘蛛抓取转换网站。

Example:

示例

Find all links inside a web page (wgetand xpathare constructs of the jARVEST's language):

查找网页中的所有链接(wget并且xpath是 jARVEST 语言的结构):

wget | xpath('//a/@href')

Inside a Java program:

在 Java 程序中:

Jarvest jarvest = new Jarvest();
  String[] results = jarvest.exec(
    "wget | xpath('//a/@href')", //robot! 
    "http://www.google.com" //inputs
  );
  for (String s : results){
    System.out.println(s);
  }

回答by Louis-wht

My answer won't probably be useful to the writer of this question (I am 8 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer.

我的回答可能对这个问题的作者没有用(我迟到了 8 个月,所以我猜不是正确的时机),但我认为它可能对许多其他可能遇到这个答案的开发人员有用。

Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo

今天,我刚刚(以我公司的名义)发布了一个 HTML 到 POJO 的完整框架,您可以使用该框架将 HTML 映射到任何 POJO 类,只需一些注释。该库本身非常方便,并且具有许多其他功能,同时非常可插拔。你可以在这里看看:https: //github.com/whimtrip/jwht-htmltopojo

How to use : Basics

使用方法:基础

Imagine we need to parse the following html page :

想象一下,我们需要解析以下 html 页面:

<html>
    <head>
        <title>A Simple HTML Document</title>
    </head>
    <body>
        <div class="restaurant">
            <h1>A la bonne Franquette</h1>
            <p>French cuisine restaurant for gourmet of fellow french people</p>
            <div class="location">
                <p>in <span>London</span></p>
            </div>
            <p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>  
            <div class="meals">
                <div class="meal">
                    <p>Veal Cutlet</p>
                    <p rating-color="green">4.5/5 stars</p>
                    <p>Chef Mr. Frenchie</p>
                </div>

                <div class="meal">
                    <p>Ratatouille</p>
                    <p rating-color="orange">3.6/5 stars</p>
                    <p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
                </div>

            </div> 
        </div>    
    </body>
</html>

Let's create the POJOs we want to map it to :

让我们创建我们想要将其映射到的 POJO:

public class Restaurant {

    @Selector( value = "div.restaurant > h1")
    private String name;

    @Selector( value = "div.restaurant > p:nth-child(2)")
    private String description;

    @Selector( value = "div.restaurant > div:nth-child(3) > p > span")    
    private String location;    

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        indexForRegexPattern = 1,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Long id;

    @Selector( 
        value = "div.restaurant > p:nth-child(4)"
        format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
        // This time, we want the second regex group and not the first one anymore
        indexForRegexPattern = 2,
        useDeserializer = true,
        deserializer = ReplacerDeserializer.class,
        preConvert = true,
        postConvert = false
    )
    // so that the number becomes a valid number as they are shown in this format : 18,190
    @ReplaceWith(value = ",", with = "")
    private Integer rank;

    @Selector(value = ".meal")    
    private List<Meal> meals;

    // getters and setters

}

And now the Mealclass as well :

现在还有Meal班级:

public class Meal {

    @Selector(value = "p:nth-child(1)")
    private String name;

    @Selector(
        value = "p:nth-child(2)",
        format = "^([0-9.]+)\/5 stars$",
        indexForRegexPattern = 1
    )
    private Float stars;

    @Selector(
        value = "p:nth-child(2)",
        // rating-color custom attribute can be used as well
        attr = "rating-color"
    )
    private String ratingColor;

    @Selector(
        value = "p:nth-child(3)"
    )
    private String chefs;

    // getters and setters.
}

We provided some more explanations on the above code on our github page.

我们在我们的 github 页面上对上述代码提供了更多解释。

For the moment, let's see how to scrap this.

现在,让我们看看如何废弃它。

private static final String MY_HTML_FILE = "my-html-file.html";

public static void main(String[] args) {


    HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();

    HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);

    // If they were several restaurants in the same page, 
    // you would need to create a parent POJO containing
    // a list of Restaurants as shown with the meals here
    Restaurant restaurant = adapter.fromHtml(getHtmlBody());

    // That's it, do some magic now!

}


private static String getHtmlBody() throws IOException {
    byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
    return new String(encoded, Charset.forName("UTF-8"));

}

Another short example can be found here

另一个简短的例子可以在这里找到

Hope this will help someone out there!

希望这会帮助那里的人!