java 如何使用 String.split() 根据 HTML 页面中的标签名称拆分字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/5070265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to split string according to tag name in an HTML page using String.split()
提问by mnish
I want to split the following string according to the td tags:
我想根据 td 标签拆分以下字符串:
<html>
<body>
<table>
<tr><td>data1</td></tr>
<tr><td>data2</td></tr>
<tr><td>data3</td></tr>
<tr><td>data4</td></tr>
</table>
</body>
I'v tried split("h2");
and split("[h2]");
but this way the split method splits the html code where it finds "h"
or "2"
and if Iam not mistaken also "h2"
.
我试过了split("h2");
,split("[h2]");
但是这样 split 方法会在它找到的地方拆分 html 代码,"h"
或者"2"
如果我也没有弄错的话"h2"
。
My ultimate goal is to retrieve everything between <td>
and </td>
我的最终目标是检索<td>
和之间的所有内容</td>
Can anyone please please tell me how to do this with onlyusing split()
?
任何人都可以请告诉我如何只使用split()
?
Thanks alot
非常感谢
回答by Matt Ball
No.
不。
That would mean — in essence — parsing HTML with regex. We don't do that 'round these parts.
这意味着——本质上——用正则表达式解析 HTML。我们不会围绕这些部分这样做。
回答by AlexR
Here is how to solve your optimal goal:
以下是如何解决您的最佳目标:
String html = ""; // your html
Pattern p = Pattern.compile("<td>([^<]*)</td>", Pattern.MULTILINE | Pattern.DOTALL);
for (Matcher m = p.matcher(html); m.find(); ) {
String tag = m.group(1);
System.out.println(tyg);
}
Please note that this code is written here without compiler but it gives the idea.
请注意,这里的代码是在没有编译器的情况下编写的,但它给出了想法。
BUT why do you want to parse HTML using regex? I agree with guys: use HTML or XML parser (if your HTML is well-formatted.)
但是为什么要使用正则表达式解析 HTML?我同意大家的看法:使用 HTML 或 XML 解析器(如果您的 HTML 格式正确。)
回答by Kurt Kaylor
You cannotsuccessfully parse HTML (or in your case, get the data between TD tags) with regular expressions. You should take a look at a simple HTML parser:
您无法使用正则表达式成功解析 HTML(或者在您的情况下,获取 TD 标记之间的数据)。你应该看看一个简单的 HTML 解析器:
import java.io.StringReader;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;
public static List<String> extractTDs(String html) throws IOException {
final List<String> tdList = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
StringBuffer buffer = new StringBuffer();
public void handleText(final char[] data, final int pos) {
buffer.append(data);
}
public void handleEndTag(Tag t, final int pos) {
if(Tag.TD.equals(t)) {
tdList.add(buffer.toString());
}
buffer = new StringBuffer();
}
};
parserDelegator.parse(new StringReader(html), parserCallback, true);
return tdList;
}
回答by Johan Sj?berg
You should reallyuse a html parser, such as neko htmlor HtmlParser.
您真的应该使用 html 解析器,例如neko html或HtmlParser。
Iffyou have a very small set of controlledhtml you could(although I generally recommend against it) use a regex such as
当且仅当你有一个非常小的一组控制HTML你可能(虽然我一般建议反对)使用正则表达式,如
(?<=\<td\>)\w+(?=\</td\>)
回答by Sanjit Saluja
String.Split or regexes should not be used to parse markup languages as they have no notion of depth (HTML is a recursive grammar needs a recursive parser). Consider what would happen if your <td>
looked like:
String.Split 或正则表达式不应该用于解析标记语言,因为它们没有深度的概念(HTML 是递归语法需要递归解析器)。考虑一下如果你<td>
看起来像会发生什么:
<td>
<table><tr><td> td inside a td? </td></tr></table>
</td>
A regex would greedily match everything between the outer <td>...</td>
giving you unwanted results.
正则表达式会贪婪地匹配外部之间的所有内容,<td>...</td>
给您带来不需要的结果。
You should use an HTML parser like Johan mentioned.
您应该使用 Johan 提到的 HTML 解析器。