java 从 URL 中提取 HTML

Question

提问by Wassim AZIRAR

I'm using Boilerpipeto extract text from url, using this code:

我正在使用Boilerpipe从 url 中提取文本，使用以下代码：

URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);

the String textcontains just the text of the html page, but I need to extract to whole html code from it.

该字符串text仅包含 html 页面的文本，但我需要从中提取整个 html 代码。

Is there anyone who used this library and knows how to extract the HTML code?

有没有人使用过这个库并且知道如何提取 HTML 代码？

You can check the demo pagefor more info on the library.

您可以查看演示页面以获取有关库的更多信息。

Answer 1

回答by Goran Jovic

For something as simple as this you don't really need an external library:

对于像这样简单的事情，您实际上并不需要外部库：

 URL url = new URL("http://www.google.com");
 InputStream is = (InputStream) url.getContent();
 BufferedReader br = new BufferedReader(new InputStreamReader(is));
 String line = null;
 StringBuffer sb = new StringBuffer();
 while((line = br.readLine()) != null){
   sb.append(line);
 }
 String htmlContent = sb.toString();

Answer 2

回答by Konrad Rudolph

Just use the KeepEverythingExtractorinstead of the ArticleExtractor.

只需使用KeepEverythingExtractor代替ArticleExtractor。

But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?

但这是将错误的工具用于错误的工作。您想要的只是下载 URL 的 HTML 内容（对吗？），而不是提取内容。那么为什么要使用内容提取器呢？

Answer 3

回答by Paul Vargas

With Java 7 and a trick of Scanner, you can do the following:

使用 Java 7 和 Scanner 的技巧，您可以执行以下操作：

public static String toHtmlString(URL url) throws IOException {
    Objects.requireNonNull(url, "The url cannot be null.");
    try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) {
        sc.useDelimiter("\A");
        if (sc.hasNext()) {
            return sc.next();
        } else {
            return null; // or empty
        }
    }
}

java 从 URL 中提取 HTML

提问by Wassim AZIRAR

回答by Goran Jovic

回答by Konrad Rudolph

回答by Paul Vargas

相关推荐

最近更新

标签

java 从 URL 中提取 HTML

提问by Wassim AZIRAR

回答by Goran Jovic

回答by Konrad Rudolph

回答by Paul Vargas

相关推荐

java 玩！框架使用 <lot> 的静态

Java 安全管理器 - 它检查什么？

在 Java 中管理 XAdES 签名的库

java 如何计算两个日期之间的差异

相关推荐

最近更新

标签