java 如何使用 String.split() 根据 HTML 页面中的标签名称拆分字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5070265/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 09:16:30  来源:igfitidea点击:

how to split string according to tag name in an HTML page using String.split()

javahtmlstringsplit

提问by mnish

I want to split the following string according to the td tags:

我想根据 td 标签拆分以下字符串:

<html>

<body>
  <table>
    <tr><td>data1</td></tr>
    <tr><td>data2</td></tr>
    <tr><td>data3</td></tr>
    <tr><td>data4</td></tr>
  </table>
</body>

I'v tried split("h2");and split("[h2]");but this way the split method splits the html code where it finds "h"or "2"and if Iam not mistaken also "h2".

我试过了split("h2");split("[h2]");但是这样 split 方法会在它找到的地方拆分 html 代码,"h"或者"2"如果我也没有弄错的话"h2"

My ultimate goal is to retrieve everything between <td>and </td>

我的最终目标是检索<td>和之间的所有内容</td>

Can anyone please please tell me how to do this with onlyusing split()?

任何人都可以请告诉我如何使用split()

Thanks alot

非常感谢

回答by Matt Ball

No.

不。

That would mean — in essence — parsing HTML with regex. We don't do that 'round these parts.

这意味着——本质上——用正则表达式解析 HTML。我们不会围绕这些部分这样做。

回答by AlexR

Here is how to solve your optimal goal:

以下是如何解决您的最佳目标:

String html = ""; // your html
Pattern p = Pattern.compile("<td>([^<]*)</td>", Pattern.MULTILINE | Pattern.DOTALL);

for (Matcher m = p.matcher(html);  m.find(); ) {
    String tag = m.group(1);
    System.out.println(tyg);
}

Please note that this code is written here without compiler but it gives the idea.

请注意,这里的代码是在没有编译器的情况下编写的,但它给出了想法。

BUT why do you want to parse HTML using regex? I agree with guys: use HTML or XML parser (if your HTML is well-formatted.)

但是为什么要使用正则表达式解析 HTML?我同意大家的看法:使用 HTML 或 XML 解析器(如果您的 HTML 格式正确。)

回答by Kurt Kaylor

You cannotsuccessfully parse HTML (or in your case, get the data between TD tags) with regular expressions. You should take a look at a simple HTML parser:

无法使用正则表达式成功解析 HTML(或者在您的情况下,获取 TD 标记之间的数据)。你应该看看一个简单的 HTML 解析器:

import java.io.StringReader;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.parser.ParserDelegator;

public static List<String> extractTDs(String html) throws IOException {
    final List<String> tdList = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
        StringBuffer buffer = new StringBuffer();
        public void handleText(final char[] data, final int pos) {
            buffer.append(data);
        }
        public void handleEndTag(Tag t, final int pos) {  
            if(Tag.TD.equals(t)) {
                tdList.add(buffer.toString());
            }
            buffer = new StringBuffer();
        }
    };

    parserDelegator.parse(new StringReader(html), parserCallback, true);

    return tdList;
}

回答by Johan Sj?berg

You should reallyuse a html parser, such as neko htmlor HtmlParser.

真的应该使用 html 解析器,例如neko htmlHtmlParser

Iffyou have a very small set of controlledhtml you could(although I generally recommend against it) use a regex such as

当且仅当你有一个非常小的一组控制HTML你可能(虽然我一般建议反对)使用正则表达式,如

(?<=\<td\>)\w+(?=\</td\>)

回答by Sanjit Saluja

String.Split or regexes should not be used to parse markup languages as they have no notion of depth (HTML is a recursive grammar needs a recursive parser). Consider what would happen if your <td>looked like:

String.Split 或正则表达式不应该用于解析标记语言,因为它们没有深度的概念(HTML 是递​​归语法需要递归解析器)。考虑一下如果你<td>看起来像会发生什么:

<td>
  <table><tr><td> td inside a td? </td></tr></table>
</td>

A regex would greedily match everything between the outer <td>...</td>giving you unwanted results.

正则表达式会贪婪地匹配外部之间的所有内容,<td>...</td>给您带来不需要的结果。

You should use an HTML parser like Johan mentioned.

您应该使用 Johan 提到的 HTML 解析器。