在Android中抓取HTML网页的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2971155/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-20 08:06:48  来源:igfitidea点击:

What is the fastest way to scrape HTML webpage in Android?

androidhtmlweb-scraping

提问by unj2

I need to extract information from an unstructured web page in Android. The information I want is embedded in a table that doesn't have an id.

我需要从 Android 中的非结构化网页中提取信息。我想要的信息嵌入在没有 id 的表中。

<table> 
<tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr> 
</table>

Should I use

我应该使用

  • Pattern Matching?
  • Use BufferedReader to extract the information?
  • 模式匹配?
  • 使用BufferedReader提取信息?

Or are there faster way to get that information?

或者有没有更快的方法来获取这些信息?

回答by Josef Pfleger

I think in this case it makes no sense to look for a fastway to extractthe information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to downloadthe HTML.

我认为在这种情况下,寻找一种快速提取信息的方法是没有意义的,因为当您将其与下载HTML所需的时间进行比较时,答案中已经建议的方法之间几乎没有性能差异。

So assuming that by fastestyou mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilderto parse the relevant HTML and extract data using XPathExpressions:

因此,假设最快的意思是最方便、可读和可维护的代码,我建议您使用 aDocumentBuilder来解析相关的 HTML 并使用XPathExpressions提取数据:

Document doc = DocumentBuilderFactory.newInstance()
  .newDocumentBuilder().parse(new InputSource(new StringReader(html)));

XPathExpression xpath = XPathFactory.newInstance()
  .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");

String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..) and if necessary correct remaining HTML errors with Stringoperations before parsing. If this gets too complex however (i.e. very badHTML), just go with the hacky pattern matching approach as suggested in other answers.

如果您碰巧检索到无效的 HTML,我建议在解析之前隔离相关部分(例如使用substring(indexOf("<table")..),并在必要时通过String操作纠正剩余的 HTML 错误。但是,如果这变得太复杂(即非常糟糕的HTML),只需按照其他答案中的建议使用 hacky 模式匹配方法。

Remarks

评论

  • XPath is available since API Level 8 (Android 2.2). If you develop for lower API levels you can use DOM methods and conditionals to navigate to the node you want to extract
  • XPath 从 API 级别 8 (Android 2.2) 开始可用。如果您针对较低的 API 级别进行开发,则可以使用 DOM 方法和条件来导航到要提取的节点

回答by BalusC

The fastestway will be parsing the specificinformation yourself. You seem to know the HTML structure precisely beforehand. The BufferedReader, Stringand StringBuildermethods should suffice. Here's a kickoff example which displays the first paragraph of your own question:

最快的方式将被解析特定的个人信息。您似乎事先就准确地了解了 HTML 结构。的BufferedReaderStringStringBuilder方法应该足够了。这是一个启动示例,显示您自己问题的第一段:

public static void main(String... args) throws Exception {
    URL url = new URL("http://stackoverflow.com/questions/2971155");
    BufferedReader reader = null;
    StringBuilder builder = new StringBuilder();
    try {
        reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        for (String line; (line = reader.readLine()) != null;) {
            builder.append(line.trim());
        }
    } finally {
        if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
    }

    String start = "<div class=\"post-text\"><p>";
    String end = "</p>";
    String part = builder.substring(builder.indexOf(start) + start.length());
    String question = part.substring(0, part.indexOf(end));
    System.out.println(question);
}

Parsing is in practically all cases definitely faster than pattern matching. Pattern matching is easier, but there is a certain risk that it may yield unexpected results, certainly when using complex regex patterns.

解析实际上在所有情况下都比模式匹配快。模式匹配更容易,但存在可能产生意外结果的风险,尤其是在使用复杂的正则表达式模式时。

You can also consider to use a more flexible 3rd party HTML parser instead of writing one yourself. It will not be as fast as parsing yourself with beforehand known information. It will however be more concise and flexible. With decent HTML parsers the difference in speed is pretty negligible. I strongly recommend Jsoupfor this. It supports jQuery-like CSS selectors. Extracting the firsrt paragraph of your question would then be as easy as:

您还可以考虑使用更灵活的第 3 方 HTML 解析器,而不是自己编写。它不会像使用事先已知的信息解析自己那样快。然而,它会更加简洁和灵活。使用不错的 HTML 解析器,速度上的差异可以忽略不计。为此,我强烈推荐Jsoup。它支持类似 jQuery 的 CSS 选择器。提取问题的第一段就很简单了:

public static void main(String... args) throws Exception {
    Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();
    String question = document.select("#question .post-text p").first().text();
    System.out.println(question);
}

It's unclear what web page you're talking about, so I can't give a more detailed example how you could select the specific information from the specific page using Jsoup. If you still can't figure it at your own using Jsoup and CSS selectors, then feel free to post the URL in a comment and I'll suggest how to do it.

目前尚不清楚您在谈论哪个网页,因此我无法提供更详细的示例,您可以使用 Jsoup 从特定页面中选择特定信息。如果您仍然无法使用 Jsoup 和CSS selectors自行解决,请随时在评论中发布 URL,我会建议如何操作。

回答by Praveen

When you Scrap Html webPage. Two things you can do for it. First One is using REGEX. Another One is Html parsers.

当你报废 Html 网页时。你可以为它做两件事。第一个是使用正则表达式。另一个是 Html 解析器。

Using Regex is not preferable by all. Because It causes logical exception at the Runtime.

并非所有人都喜欢使用正则表达式。因为它会在运行时导致逻辑异常。

Using Html Parser is More Complicated to do. you can not sure proper output will come. its too made some runtime exception by my experience.

使用 Html Parser 做起来更复杂。你不能确定会出现正确的输出。根据我的经验,它也产生了一些运行时异常。

So Better make response of the url to Xml file. and do xml parsingis very easy and effective.

所以最好让 url 响应 Xml 文件。并且做xml解析非常简单有效。

回答by Fedor

Why don't you just write

你为什么不写

int start=data.indexOf("Description");

int start=data.indexOf("描述");

After that take the required substring.

之后取所需的子字符串。

回答by Oren Hizkiya

Why don't you create a script that does the scraping with cURL and simple html dom parser and just grab the value you need from that page? These tools work with PHP, but other tools exist for exist for any language you need.

为什么不创建一个脚本来使用 cURL 和简单的 html dom 解析器进行抓取,然后从该页面中获取所需的值?这些工具可与 PHP 一起使用,但也有其他工具可用于您需要的任何语言。

回答by mtmurdock

One way of doing this is to put the html into a String and then manually search and parse through the String. If you know that the tags will come in a specific order then you should be able to crawl through it and find the data. This however is kinda sloppy, so its a question of do you want it to work now? or work well?

一种方法是将 html 放入一个字符串中,然后手动搜索和解析字符串。如果您知道标签将按特定顺序出现,那么您应该能够浏览它并找到数据。然而,这有点草率,所以问题是你希望它现在工作吗?或运作良好

int position = (String)html.indexOf("<table>");  //html being the String holding the html code
String field = html.substring(html.indexOf("<td>",html.indexOf("<td>",position)) + 4, html.indexOf("</td>",html.indexOf("</td>",position)));

like i said... really sloppy. But if you're only doing this once and you need it to work, this just might do the trick.

就像我说的……真的很草率。但是,如果您只执行一次并且需要它工作,那么这可能会奏效。