在 HTML (Java) 中查找值的快速方法

Question

提问by pek

Using regular expressions, what is the simplest way to fetch a websites HTML and find the value inside this tag (or any attribute's value for that matter):

使用正则表达式，获取网站 HTML 并找到此标签内的值（或任何属性的值）的最简单方法是什么：

<html>
  <head>
  [snip]
  <meta name="generator" value="thevalue i'm looking for" />
  [snip]

Answer 1

回答by Mike Haboustak

Depends on how sophisticated of an Http request you need to build (authentication, etc). Here's one simple way I've seen used in the past.

取决于您需要构建的 Http 请求的复杂程度（身份验证等）。这是我过去见过的一种简单方法。

StringBuilder html = new StringBuilder();
java.net.URL url = new URL("http://www.google.com/");
BufferedReader input = null;
try {
    input new BufferedReader(
        new InputStreamReader(url.openStream()));

    String htmlLine;
    while ((htmlLine=input.readLine())!=null) {
        html.appendLine(htmlLine);
    }
}
finally {
    input.close();
}

Pattern exp = Pattern.compile(
    "<meta name=\"generator\" value=\"([^\"]*)\" />");
Matcher matcher = exp.matcher(html.toString());
if(matcher.find())
{
    System.out.println("Generator: "+matcher.group(1));
}

Probably plenty of typos here to be found when compiled. (hope this wasn't homework)

编译时可能会发现很多错别字。（希望这不是作业）

Answer 2

回答by Mads Burgandy

Its amazing how noone, when addressing the problem of using RegEx with HTML, confronts the problem of HTML often NOTbeing well-formed, thus rendering a lot of HTML-parsers completely useless.

令人惊讶的是，在解决将 RegEx 与 HTML 一起使用的问题时，没有人会遇到 HTML 格式通常不正确的问题，从而使许多 HTML 解析器完全无用。

If you are developing tools to analyze webpages and its a fact that these are not well-formed HTML, the statement "Regex should never be used to parse HTML" og "use a HTML parser" is just completely bogus. Facts are that in the real world, people create HTML as they feel like - and not necessarily suited for parsers.

如果您正在开发分析网页的工具，并且事实上这些不是格式良好的 HTML，那么“Regex 永远不应该用于解析 HTML”或“使用 HTML 解析器”这句话完全是假的。事实是，在现实世界中，人们根据自己的感觉创建 HTML - 不一定适合解析器。

RegEx isa completely valid way to find elements in text, thus in HTML. If there are any other reasonable way to confront the problems the Original Poster has, then post them instead of referring to a "use a parser" or "RTFM" statement.

RegEx是一种在文本中查找元素的完全有效的方法，因此在 HTML 中。如果有任何其他合理的方法来解决原始海报的问题，那么将它们发布而不是引用“使用解析器”或“RTFM”声明。

Answer 3

回答by vrdhn

You should be using XPath query.

您应该使用 XPath 查询。

It's as simple as getting value of /html/head/meta[@name=generator]/@value.

就像获取的值一样简单/html/head/meta[@name=generator]/@value。

A good tutorial: Parsing an XML Document with XPath

一个很好的教程：使用 XPath 解析 XML 文档

Answer 4

回答by Stephen C

It depends.

这取决于。

If you are extracting information from a site or sites that are guaranteed to be well-formed HTML, and you know that the <meta> won't be obfuscated in some way then a reading the <head> section line by line and applying a regex is a good approach.

如果您从一个或多个保证格式良好的 HTML 的站点中提取信息，并且您知道 <meta> 不会以某种方式被混淆，那么请逐行阅读 <head> 部分并应用正则表达式是一个很好的方法。

On the other hand, if the HTML may be mangled or "tricky" then you need to use a proper HTML parser, possibly a permissive one like HTMLTidy. Beware of using a strict HTML or XML parser on stuff trawled from random websites. Lots of so-called HTML you find out there is actually malformed.

另一方面，如果 HTML 可能被破坏或“棘手”，那么您需要使用适当的 HTML 解析器，可能是像 HTMLTidy 这样的宽松解析器。当心对从随机网站上抓取的内容使用严格的 HTML 或 XML 解析器。您发现的许多所谓的 HTML 实际上格式不正确。

Answer 5

回答by Justin Bennett

You may want to check the documentation for Apache's org.apache.commons.HttpClient package and the related packages here. Sending an HTTP request from a Java application is pretty easy to do. Poking through the documentation should get you off in the right direction.

您可能需要在此处查看 Apache 的 org.apache.commons.HttpClient 包和相关包的文档。从 Java 应用程序发送 HTTP 请求非常容易。翻阅文档应该会让你朝着正确的方向前进。

Answer 6

回答by Paul Tomblin

I haven't tried this, but wouldn't the basic framework be

我没有试过这个，但基本框架不会是

Open a java.net.HttpURLConnection
Get an input stream using getInputStream
Use the regular expression in Mike's answer to parse out the bit you want

打开一个 java.net.HttpURLConnection
使用获取输入流 getInputStream
使用 Mike's answer 中的正则表达式解析出你想要的位

Answer 7

回答by Eek

Strictly speaking you can't really be sure you got the right value, since the meta tag may be commented out, or the meta tag may be in uppercase etc. It depends on how certain you are that the HTML can be considered as "nice".

严格来说，你不能确定你得到了正确的值，因为元标记可能被注释掉，或者元标记可能是大写的等等。这取决于你有多确定 HTML 可以被认为是“好的”。

在 HTML (Java) 中查找值的快速方法

提问by pek

回答by Mike Haboustak

回答by Mads Burgandy

回答by vrdhn

回答by Stephen C

回答by Justin Bennett

回答by Paul Tomblin

回答by Eek

相关推荐

最近更新

标签

在 HTML (Java) 中查找值的快速方法

提问by pek

回答by Mike Haboustak

回答by Mads Burgandy

回答by vrdhn

回答by Stephen C

回答by Justin Bennett

回答by Paul Tomblin

回答by Eek

相关推荐

javascript 在 React 功能组件中使用 async/await

javascript 运行 npm serve 时获取错误消息模块构建失败（来自 ./node_modules/sass-loader/dist/cjs.js）

javascript 如何在 iOS 13 上的 Safari 中检测设备名称，但未显示正确的用户代理？

javascript 如何解决错误“对于 CORS 请求，URL 方案必须是“http”或“https”。对于此代码

相关推荐

最近更新

标签