java 如何使用 String.split() 根据 HTML 页面中的标签名称拆分字符串

Question

提问by mnish

I want to split the following string according to the td tags:

我想根据 td 标签拆分以下字符串：

<html>

<body>
  <table>
    <tr><td>data1</td></tr>
    <tr><td>data2</td></tr>
    <tr><td>data3</td></tr>
    <tr><td>data4</td></tr>
  </table>
</body>

I'v tried split("h2");and split("[h2]");but this way the split method splits the html code where it finds "h"or "2"and if Iam not mistaken also "h2".

我试过了split("h2");，split("[h2]");但是这样 split 方法会在它找到的地方拆分 html 代码，"h"或者"2"如果我也没有弄错的话"h2"。

My ultimate goal is to retrieve everything between <td>and </td>

我的最终目标是检索<td>和之间的所有内容</td>

Can anyone please please tell me how to do this with onlyusing split()?

任何人都可以请告诉我如何只使用split()？

Thanks alot

非常感谢

Answer 1

回答by Matt Ball

No.

不。

That would mean — in essence — parsing HTML with regex. We don't do that 'round these parts.

这意味着——本质上——用正则表达式解析 HTML。我们不会围绕这些部分这样做。

Answer 2

回答by AlexR

Here is how to solve your optimal goal:

以下是如何解决您的最佳目标：

String html = ""; // your html
Pattern p = Pattern.compile("<td>([^<]*)</td>", Pattern.MULTILINE | Pattern.DOTALL);

for (Matcher m = p.matcher(html);  m.find(); ) {
    String tag = m.group(1);
    System.out.println(tyg);
}

Please note that this code is written here without compiler but it gives the idea.

请注意，这里的代码是在没有编译器的情况下编写的，但它给出了想法。

BUT why do you want to parse HTML using regex? I agree with guys: use HTML or XML parser (if your HTML is well-formatted.)

但是为什么要使用正则表达式解析 HTML？我同意大家的看法：使用 HTML 或 XML 解析器（如果您的 HTML 格式正确。）

Answer 3

回答by Kurt Kaylor

You cannotsuccessfully parse HTML (or in your case, get the data between TD tags) with regular expressions. You should take a look at a simple HTML parser:

您无法使用正则表达式成功解析 HTML（或者在您的情况下，获取 TD 标记之间的数据）。你应该看看一个简单的 HTML 解析器：

import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public static List<String> extractTDs(String html) throws IOException {
    final List<String> tdList = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
        StringBuffer buffer = new StringBuffer();
        public void handleText(final char[] data, final int pos) {
            buffer.append(data);
        }
        public void handleEndTag(Tag t, final int pos) {  
            if(Tag.TD.equals(t)) {
                tdList.add(buffer.toString());
            }
            buffer = new StringBuffer();
        }
    };

    parserDelegator.parse(new StringReader(html), parserCallback, true);

    return tdList;
}

Answer 4

回答by Johan Sj?berg

You should reallyuse a html parser, such as neko htmlor HtmlParser.

您真的应该使用 html 解析器，例如neko html或HtmlParser。

Iffyou have a very small set of controlledhtml you could(although I generally recommend against it) use a regex such as

当且仅当你有一个非常小的一组控制HTML你可能（虽然我一般建议反对）使用正则表达式，如

(?<=\<td\>)\w+(?=\</td\>)

Answer 5

回答by Sanjit Saluja

String.Split or regexes should not be used to parse markup languages as they have no notion of depth (HTML is a recursive grammar needs a recursive parser). Consider what would happen if your <td>looked like:

String.Split 或正则表达式不应该用于解析标记语言，因为它们没有深度的概念（HTML 是递归语法需要递归解析器）。考虑一下如果你<td>看起来像会发生什么：

<td>
  <table><tr><td> td inside a td? </td></tr></table>
</td>

A regex would greedily match everything between the outer <td>...</td>giving you unwanted results.

正则表达式会贪婪地匹配外部之间的所有内容，<td>...</td>给您带来不需要的结果。

You should use an HTML parser like Johan mentioned.

您应该使用 Johan 提到的 HTML 解析器。

java 如何使用 String.split() 根据 HTML 页面中的标签名称拆分字符串

提问by mnish

回答by Matt Ball

No.

不。

回答by AlexR

回答by Kurt Kaylor

回答by Johan Sj?berg

回答by Sanjit Saluja

相关推荐

最近更新

标签

java 如何使用 String.split() 根据 HTML 页面中的标签名称拆分字符串

提问by mnish

回答by Matt Ball

No.

不。

回答by AlexR

回答by Kurt Kaylor

回答by Johan Sj?berg

回答by Sanjit Saluja

相关推荐

java“ stringtokenizer.nextToken(delimiter); ”如何工作？

java 检测文本文件中的制表符空间和下一个标记符号

Java 浮动 123.129456 到 123.12 没有四舍五入

java Hibernate 从连接表中获取数据

相关推荐

最近更新

标签