Java正则表达式提取标签之间的文本

Question

提问by b10hazard

I have a file with some custom tags and I'd like to write a regular expression to extract the string between the tags. For example if my tag is:

我有一个带有一些自定义标签的文件，我想编写一个正则表达式来提取标签之间的字符串。例如，如果我的标签是：

[customtag]String I want to extract[/customtag]

How would I write a regular expression to extract only the string between the tags. This code seems like a step in the right direction:

我将如何编写正则表达式来仅提取标签之间的字符串。这段代码似乎是朝着正确方向迈出的一步：

Pattern p = Pattern.compile("[customtag](.+?)[/customtag]");
Matcher m = p.matcher("[customtag]String I want to extract[/customtag]");

Not sure what to do next. Any ideas? Thanks.

不知道下一步该怎么做。有任何想法吗？谢谢。

Answer 1

采纳答案by hoipolloi

You're on the right track. Now you just need to extract the desired group, as follows:

你在正确的轨道上。现在您只需要提取所需的组，如下所示：

final Pattern pattern = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher("<tag>String I want to extract</tag>");
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

If you want to extract multiple hits, try this:

如果要提取多个匹配项，请尝试以下操作：

public static void main(String[] args) {
    final String str = "<tag>apple</tag><b>hello</b><tag>orange</tag><tag>pear</tag>";
    System.out.println(Arrays.toString(getTagValues(str).toArray())); // Prints [apple, orange, pear]
}

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

private static List<String> getTagValues(final String str) {
    final List<String> tagValues = new ArrayList<String>();
    final Matcher matcher = TAG_REGEX.matcher(str);
    while (matcher.find()) {
        tagValues.add(matcher.group(1));
    }
    return tagValues;
}

However, I agree that regular expressions are not the best answer here. I'd use XPath to find elements I'm interested in. See The Java XPath APIfor more info.

但是，我同意正则表达式不是这里的最佳答案。我会使用 XPath 来查找我感兴趣的元素。有关更多信息，请参阅Java XPath API。

Answer 2

回答by Shirik

I prefix this reply with "you shouldn't use a regular expression to parse XML -- it's only going to result in edge cases that don't work right, and a forever-increasing-in-complexity regex while you try to fix it."

我在此回复前加上“您不应该使用正则表达式来解析 XML——它只会导致无法正常工作的边缘情况，以及在您尝试修复它时永远增加复杂性的正则表达式.”

That being said, you need to proceed by matching the string and grabbing the group you want:

话虽如此，您需要继续匹配字符串并获取所需的组：

if (m.matches())
{
   String result = m.group(1);
   // do something with result
}

Answer 3

回答by jdc0589

To be quite honest, regular expressions are not the best idea for this type of parsing. The regular expression you posted will probably work great for simple cases, but if things get more complex you are going to have huge problems (same reason why you cant reliably parse HTML with regular expressions). I know you probably don't want to hear this, I know I didn't when I asked the same type of questions, but string parsing became WAY more reliable for me after I stopped trying to use regular expressions for everything.

老实说，正则表达式对于这种类型的解析并不是最好的主意。您发布的正则表达式可能适用于简单的情况，但如果事情变得更复杂，您将遇到巨大的问题（这与您无法使用正则表达式可靠地解析 HTML 的原因相同）。我知道你可能不想听到这个，我知道当我问相同类型的问题时我没有，但是在我停止尝试对所有内容使用正则表达式后，字符串解析对我来说变得更加可靠。

jTopasis an AWESOME tokenizer that makes it quite easy to write parsers by hand (I STRONGLY suggest jtopas over the standard java scanner/etc.. libraries). If you want to see jtopas in action, hereare some parsers I wrote using jTopas to parse thistype of file

jTopas是一个很棒的标记器，它使得手工编写解析器变得非常容易（我强烈建议在标准 java 扫描器/等库上使用 jtopas）。如果您想查看 jtopas 的运行情况，这里有一些我使用 jTopas 编写的解析器来解析这种类型的文件

If you are parsing XML files, you should be using an xml parser library. Dont do it youself unless you are just doing it for fun, there are plently of proven options out there

如果要解析 XML 文件，则应该使用 xml 解析器库。不要自己做，除非你只是为了好玩而做，有很多经过验证的选择

Answer 4

回答by Bibhuti Agarwal

    final Pattern pattern = Pattern.compile("tag\](.+?)\[/tag");
    final Matcher matcher = pattern.matcher("[tag]String I want to extract[/tag]");
    matcher.find();
    System.out.println(matcher.group(1));

Answer 5

回答by Gorky

A generic,simpler and a bit primitive approach to find tag, attribute and value

一种通用的、更简单的、有点原始的方法来查找标签、属性和值

    Pattern pattern = Pattern.compile("<(\w+)( +.+)*>((.*))</\1>");
    System.out.println(pattern.matcher("<asd> TEST</asd>").find());
    System.out.println(pattern.matcher("<asd TEST</asd>").find());
    System.out.println(pattern.matcher("<asd attr='3'> TEST</asd>").find());
    System.out.println(pattern.matcher("<asd> <x>TEST<x>asd>").find());
    System.out.println("-------");
    Matcher matcher = pattern.matcher("<as x> TEST</as>");
    if (matcher.find()) {
        for (int i = 0; i <= matcher.groupCount(); i++) {
            System.out.println(i + ":" + matcher.group(i));
        }
    }

Answer 6

回答by Heriberto Rivera

Try this:

尝试这个：

Pattern p = Pattern.compile(?<=\<(any_tag)\>)(\s*.*\s*)(?=\<\/(any_tag)\>);
Matcher m = p.matcher(anyString);

For example:

例如：

String str = "<TR> <TD>1Q Ene</TD> <TD>3.08%</TD> </TR>";
Pattern p = Pattern.compile("(?<=\<TD\>)(\s*.*\s*)(?=\<\/TD\>)");
Matcher m = p.matcher(str);
while(m.find()){
   Log.e("Regex"," Regex result: " + m.group())       
}

Output:

输出：

10 Ene

10 烯

3.08%

Answer 7

回答by Shubham Khurana

    String s = "<B><G>Test</G></B><C>Test1</C>";

    String pattern ="\<(.+)\>([^\<\>]+)\<\/\1\>";

       int count = 0;

        Pattern p = Pattern.compile(pattern);
        Matcher m =  p.matcher(s);
        while(m.find())
        {
            System.out.println(m.group(2));
            count++;
        }

Java正则表达式提取标签之间的文本

提问by b10hazard

采纳答案by hoipolloi

回答by Shirik

回答by jdc0589

回答by Bibhuti Agarwal

回答by Gorky

回答by Heriberto Rivera

回答by Shubham Khurana

相关推荐

最近更新

标签

Java正则表达式提取标签之间的文本

提问by b10hazard

采纳答案by hoipolloi

回答by Shirik

回答by jdc0589

回答by Bibhuti Agarwal

回答by Gorky

回答by Heriberto Rivera

回答by Shubham Khurana

相关推荐

如何 Javadoc 一个类的单个枚举

如何返回与使用 Java 6 传入的类相同类型的对象的实例？

EasyMock : java.lang.IllegalStateException: 1 个匹配器，2 个记录

Java int[] 数组（从低到高排序）

相关推荐

最近更新

标签