java 如何从 HTML 网站中提取数据？

Question

提问by ankit rawat

I need to extract some text from a html based website. I have about 3000 URLs and need to extract a single line of text from their html. The data I need looks like this:

我需要从基于 html 的网站中提取一些文本。我有大约 3000 个 URL，需要从它们的 html 中提取一行文本。我需要的数据如下所示：

 <html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>Pink Floyd Live Audio Feeds</title>// the line i need
...

How can I automate this process? I am good at Java so a methodology using that language is preferred. Thanks!

我怎样才能自动化这个过程？我擅长 Java，因此首选使用该语言的方法。谢谢！

Answer 1

回答by Sarath Kumar Sivan

You can use jsoupwhich is a good Java library for working with real-world HTML.

您可以使用jsoup，它是一个很好的 Java 库，用于处理现实世界的 HTML。

Answer 2

回答by Pshemo

You can read html text line by line and when you find </title>stop reading rest of page. Here is how this can be done (I assume that <title>and </title>are in the same line of HTML code as you pointed in comment)

您可以逐行阅读 html 文本，当您发现</title>停止阅读页面的其余部分时。这是如何做到的（我假设<title>和</title>你在评论中指出的在同一行 HTML 代码中）

public static String getTitle(String address) throws IOException {
    URL url = new URL(address);
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream()));

        String line = null;
        while ((line = reader.readLine()) != null) {
            int start = line.indexOf("<title>");
            int end = line.indexOf("</title>");

            if (start != -1) {
                return line.substring(start + "<title>".length(), end);
            }
        }

        return "";
    } finally {
        if (reader != null)
            reader.close();
    }
}

Answer 3

回答by ryan

Iterate through your list of URLS and use HttpURLConnectionto download the page. After you have all of the pages process the data to extract the information you need. Here's the HttpURLConnection java doc page

遍历您的 URL 列表并用于HttpURLConnection下载页面。在您让所有页面处理数据以提取您需要的信息之后。这是 HttpURLConnection java 文档页面

java 如何从 HTML 网站中提取数据？

提问by ankit rawat

回答by Sarath Kumar Sivan

回答by Pshemo

回答by ryan

相关推荐

最近更新

标签

java 如何从 HTML 网站中提取数据？

提问by ankit rawat

回答by Sarath Kumar Sivan

回答by Pshemo

回答by ryan

相关推荐

Java 键绑定

关闭扫描仪抛出 java.util.NoSuchElementException

java.lang.NoSuchMethodError: net.sf.ehcache.config.CacheConfiguration.isTerracottaClustered()

java MySQL blob 到 Netbeans JLabel

相关推荐

最近更新

标签