java 从 URL 中提取 HTML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5213558/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract HTML from URL
提问by Wassim AZIRAR
I'm using Boilerpipeto extract text from url, using this code:
我正在使用Boilerpipe从 url 中提取文本,使用以下代码:
URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);
the String text
contains just the text of the html page, but I need to extract to whole html code from it.
该字符串text
仅包含 html 页面的文本,但我需要从中提取整个 html 代码。
Is there anyone who used this library and knows how to extract the HTML code?
有没有人使用过这个库并且知道如何提取 HTML 代码?
You can check the demo pagefor more info on the library.
您可以查看演示页面以获取有关库的更多信息。
回答by Goran Jovic
For something as simple as this you don't really need an external library:
对于像这样简单的事情,您实际上并不需要外部库:
URL url = new URL("http://www.google.com");
InputStream is = (InputStream) url.getContent();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
StringBuffer sb = new StringBuffer();
while((line = br.readLine()) != null){
sb.append(line);
}
String htmlContent = sb.toString();
回答by Konrad Rudolph
Just use the KeepEverythingExtractor
instead of the ArticleExtractor
.
只需使用KeepEverythingExtractor
代替ArticleExtractor
。
But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?
但这是将错误的工具用于错误的工作。您想要的只是下载 URL 的 HTML 内容(对吗?),而不是提取内容。那么为什么要使用内容提取器呢?
回答by Paul Vargas
With Java 7 and a trick of Scanner, you can do the following:
使用 Java 7 和 Scanner 的技巧,您可以执行以下操作:
public static String toHtmlString(URL url) throws IOException {
Objects.requireNonNull(url, "The url cannot be null.");
try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) {
sc.useDelimiter("\A");
if (sc.hasNext()) {
return sc.next();
} else {
return null; // or empty
}
}
}