java 如何从 HTML 网站中提取数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15450161/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 19:38:07  来源:igfitidea点击:

How to extract data from HTML websites?

javahtmlautomationextraction

提问by ankit rawat

I need to extract some text from a html based website. I have about 3000 URLs and need to extract a single line of text from their html. The data I need looks like this:

我需要从基于 html 的网站中提取一些文本。我有大约 3000 个 URL,需要从它们的 html 中提取一行文本。我需要的数据如下所示:

 <html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>Pink Floyd Live Audio Feeds</title>// the line i need
...

How can I automate this process? I am good at Java so a methodology using that language is preferred. Thanks!

我怎样才能自动化这个过程?我擅长 Java,因此首选使用该语言的方法。谢谢!

回答by Sarath Kumar Sivan

You can use jsoupwhich is a good Java library for working with real-world HTML.

您可以使用jsoup,它是一个很好的 Java 库,用于处理现实世界的 HTML。

回答by Pshemo

You can read html text line by line and when you find </title>stop reading rest of page. Here is how this can be done (I assume that <title>and </title>are in the same line of HTML code as you pointed in comment)

您可以逐行阅读 html 文本,当您发现</title>停止阅读页面的其余部分时。这是如何做到的(我假设<title></title>你在评论中指出的在同一行 HTML 代码中)

public static String getTitle(String address) throws IOException {
    URL url = new URL(address);
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream()));

        String line = null;
        while ((line = reader.readLine()) != null) {
            int start = line.indexOf("<title>");
            int end = line.indexOf("</title>");

            if (start != -1) {
                return line.substring(start + "<title>".length(), end);
            }
        }

        return "";
    } finally {
        if (reader != null)
            reader.close();
    }
}

回答by ryan

Iterate through your list of URLS and use HttpURLConnectionto download the page. After you have all of the pages process the data to extract the information you need. Here's the HttpURLConnection java doc page

遍历您的 URL 列表并用于HttpURLConnection下载页面。在您让所有页面处理数据以提取您需要的信息之后。这是 HttpURLConnection java 文档页面