Java 如何“扫描”网站(或页面)以获取信息,并将其带入我的程序?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2835505/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to "scan" a website (or page) for info, and bring it into my program?
提问by James
Well, I'm pretty much trying to figure out how to pull information from a webpage, and bring it into my program (in Java).
嗯,我非常想弄清楚如何从网页中提取信息,并将其带入我的程序(在 Java 中)。
For example, if I know the exact page I want info from, for the sake of simplicity a Best Buy item page, how would I get the appropriate info I need off of that page? Like the title, price, description?
例如,如果我知道我想要信息的确切页面,为了简单起见,Best Buy 项目页面,我如何从该页面获取我需要的适当信息?喜欢标题,价格,描述?
What would this process even be called? I have no idea were to even begin researching this.
这个过程甚至会被称为什么?我什至不知道要开始研究这个。
Edit: Okay, I'm running a test for the JSoup(the one posted by BalusC), but I keep getting this error:
编辑:好的,我正在对 JSoup(BalusC 发布的那个)进行测试,但我不断收到此错误:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.LinkedList.peekFirst()Ljava/lang/Object;
at org.jsoup.parser.TokenQueue.consumeWord(TokenQueue.java:209)
at org.jsoup.parser.Parser.parseStartTag(Parser.java:117)
at org.jsoup.parser.Parser.parse(Parser.java:76)
at org.jsoup.parser.Parser.parse(Parser.java:51)
at org.jsoup.Jsoup.parse(Jsoup.java:28)
at org.jsoup.Jsoup.parse(Jsoup.java:56)
at test.main(test.java:12)
I do have Apache Commons
我确实有 Apache Commons
采纳答案by BalusC
Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Javasince it supportsjQuerylike CSS selectors. Also, its class representing a list of nodes, Elements
, implements Iterable
so that you can iterate over it in an enhanced for loop(so there's no need to hassle with verbose Node
and NodeList
like classes in the average Java DOM parser).
使用像Jsoup这样的 HTML 解析器。这比Java 中可用的其他 HTML 解析器更具有我的偏好,因为它支持像CSS 选择器这样的jQuery。此外,它的表示节点列表的类, 实现了这样你可以在一个增强的 for 循环中迭代它(所以没有必要在普通的 Java DOM 解析器中处理冗长和类似的类)。Elements
Iterable
Node
NodeList
Here's a basic kickoff example (just put the latest Jsoup JAR filein classpath):
这是一个基本的启动示例(只需将最新的 Jsoup JAR 文件放在类路径中):
package com.stackoverflow.q2835505;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws Exception {
String url = "https://stackoverflow.com/questions/2835505";
Document document = Jsoup.connect(url).get();
String question = document.select("#question .post-text").text();
System.out.println("Question: " + question);
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
}
}
As you might have guessed, this prints your own question and the names of all answerers.
您可能已经猜到了,这会打印您自己的问题和所有回答者的姓名。
回答by Roman
You may use an html parser (many useful links here: java html parser).
您可以使用 html 解析器(这里有许多有用的链接:java html parser)。
The process is called 'grabbing website content'. Search 'grab website content java' for further invertigation.
该过程称为“抓取网站内容”。搜索“抓取网站内容 java”以获取进一步的信息。
回答by Nelson
Look into the cURL library. I've never used it in Java, but I'm sure there must be bindings for it. Basically, what you'll do is send a cURL request to whatever page you want to 'scrape'. The request will return a string with the source code to the page. From there, you will use regex to parse whatever data you want from the source code. That's generally how you are going to do it.
查看 cURL 库。我从未在 Java 中使用过它,但我确信它必须有绑定。基本上,您要做的是向您想要“抓取”的任何页面发送一个 cURL 请求。该请求将向页面返回一个带有源代码的字符串。从那里,您将使用正则表达式从源代码中解析您想要的任何数据。这通常是您将要执行的操作。
回答by sblundy
This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.
这被称为屏幕抓取,维基百科有这篇关于更具体的网页抓取的文章。这可能是一个重大挑战,因为那里有一些丑陋的、混乱的、坏掉的 HTML,如果不是因为浏览器聪明,那么祝你好运。
回答by Kurru
You'd probably want to look at the HTML to see if you can find strings that are unique and near your text, then you can use line/char-offsets to get to the data.
您可能想查看 HTML 以查看是否可以找到唯一且靠近文本的字符串,然后您可以使用 line/char-offsets 来获取数据。
Could be awkward in Java, if there aren't any XML classes similar to the ones found in System.XML.Linq
in C#.
如果没有任何类似于System.XML.Linq
C# 中的XML 类,在 Java 中可能会很尴尬。
回答by mdma
I would use JTidy- it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.
我会使用JTidy- 它类似于 JSoup,但我不太了解 JSoup。JTidy 处理损坏的 HTML 并返回一个 w3c 文档,因此您可以使用它作为 XSLT 的源来提取您真正感兴趣的内容。如果您不了解 XSLT,那么您不妨使用 JSoup,作为文档模型比 w3c 更好用。
EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.
编辑:在 JSoup 网站上快速浏览显示 JSoup 可能确实是更好的选择。它似乎支持开箱即用的 CSS 选择器,用于从文档中提取内容。这可能比使用 XSLT 容易得多。
回答by Anton
JSoup solution is great, but if you need to extract just something really simple it may be easier to use regex or String.indexOf
JSoup 解决方案很棒,但是如果您只需要提取一些非常简单的东西,使用 regex 或 String.indexOf 可能会更容易
As others have already mentioned the process is called scraping
正如其他人已经提到的,该过程称为刮擦
回答by Kalpesh Soni
jsoup supports java 1.5
jsoup 支持 java 1.5
https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3
https://github.com/tburch/jsoup/commit/d8ea84f46e009a7f144ee414a9fa73ea187019a3
looks like that stack was a bug, and has been fixed
看起来那个堆栈是一个错误,并已修复
回答by lipido
You could also try jARVEST.
你也可以试试jARVEST。
It is based on a JRuby DSL over a pure-Java engine to spider-scrape-transform web sites.
它基于 JRuby DSL 和纯 Java 引擎,用于蜘蛛抓取转换网站。
Example:
示例:
Find all links inside a web page (wget
and xpath
are constructs of the jARVEST's language):
查找网页中的所有链接(wget
并且xpath
是 jARVEST 语言的结构):
wget | xpath('//a/@href')
Inside a Java program:
在 Java 程序中:
Jarvest jarvest = new Jarvest();
String[] results = jarvest.exec(
"wget | xpath('//a/@href')", //robot!
"http://www.google.com" //inputs
);
for (String s : results){
System.out.println(s);
}
回答by Louis-wht
My answer won't probably be useful to the writer of this question (I am 8 months late so not the right timing I guess) but I think it will probably be useful for many other developers that might come across this answer.
我的回答可能对这个问题的作者没有用(我迟到了 8 个月,所以我猜不是正确的时机),但我认为它可能对许多其他可能遇到这个答案的开发人员有用。
Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo
今天,我刚刚(以我公司的名义)发布了一个 HTML 到 POJO 的完整框架,您可以使用该框架将 HTML 映射到任何 POJO 类,只需一些注释。该库本身非常方便,并且具有许多其他功能,同时非常可插拔。你可以在这里看看:https: //github.com/whimtrip/jwht-htmltopojo
How to use : Basics
使用方法:基础
Imagine we need to parse the following html page :
想象一下,我们需要解析以下 html 页面:
<html>
<head>
<title>A Simple HTML Document</title>
</head>
<body>
<div class="restaurant">
<h1>A la bonne Franquette</h1>
<p>French cuisine restaurant for gourmet of fellow french people</p>
<div class="location">
<p>in <span>London</span></p>
</div>
<p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>
<div class="meals">
<div class="meal">
<p>Veal Cutlet</p>
<p rating-color="green">4.5/5 stars</p>
<p>Chef Mr. Frenchie</p>
</div>
<div class="meal">
<p>Ratatouille</p>
<p rating-color="orange">3.6/5 stars</p>
<p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
</div>
</div>
</div>
</body>
</html>
Let's create the POJOs we want to map it to :
让我们创建我们想要将其映射到的 POJO:
public class Restaurant {
@Selector( value = "div.restaurant > h1")
private String name;
@Selector( value = "div.restaurant > p:nth-child(2)")
private String description;
@Selector( value = "div.restaurant > div:nth-child(3) > p > span")
private String location;
@Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
indexForRegexPattern = 1,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
@ReplaceWith(value = ",", with = "")
private Long id;
@Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
// This time, we want the second regex group and not the first one anymore
indexForRegexPattern = 2,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
@ReplaceWith(value = ",", with = "")
private Integer rank;
@Selector(value = ".meal")
private List<Meal> meals;
// getters and setters
}
And now the Meal
class as well :
现在还有Meal
班级:
public class Meal {
@Selector(value = "p:nth-child(1)")
private String name;
@Selector(
value = "p:nth-child(2)",
format = "^([0-9.]+)\/5 stars$",
indexForRegexPattern = 1
)
private Float stars;
@Selector(
value = "p:nth-child(2)",
// rating-color custom attribute can be used as well
attr = "rating-color"
)
private String ratingColor;
@Selector(
value = "p:nth-child(3)"
)
private String chefs;
// getters and setters.
}
We provided some more explanations on the above code on our github page.
我们在我们的 github 页面上对上述代码提供了更多解释。
For the moment, let's see how to scrap this.
现在,让我们看看如何废弃它。
private static final String MY_HTML_FILE = "my-html-file.html";
public static void main(String[] args) {
HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();
HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);
// If they were several restaurants in the same page,
// you would need to create a parent POJO containing
// a list of Restaurants as shown with the meals here
Restaurant restaurant = adapter.fromHtml(getHtmlBody());
// That's it, do some magic now!
}
private static String getHtmlBody() throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
return new String(encoded, Charset.forName("UTF-8"));
}
Another short example can be found here
另一个简短的例子可以在这里找到
Hope this will help someone out there!
希望这会帮助那里的人!