Java regex 去除 XML 标签,但不去除标签内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15769028/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 20:48:26  来源:igfitidea点击:

Java regex to strip out XML tags, but not tag contents

javaxmlregexstring

提问by IAmYourFaja

I have the following Java code:

我有以下 Java 代码:

str = str.replaceAll("<.*?>.*?</.*?>|<.*?/>", "");

This turns a String like so:

这会变成一个像这样的字符串:

How now <fizz>brown</fizz> cow.

Into:

进入:

How now  cow.

However, I want it to just strip the <fizz>and </fizz>tags, or just standalone </fizz> tags, and leave the element's content alone. So, a regex that would turn the above into:

但是,我希望它只是去掉<fizz></fizz>标签,或者只是独立的</fizz> 标签,并保留元素的内容。所以,一个正则表达式可以把上面的内容变成:

How now brown cow.

Or, using a more complex String, somethng that turns:

或者,使用更复杂的字符串,会变成:

How <buzz>now <fizz>brown</fizz><yoda/></buzz> cow.

Into:

进入:

How now brown cow.

I tried this:

我试过这个:

str = str.replaceAll("<.*?></.*?>|<.*?/>", "");

And that doesn't work at all. Any ideas? Thanks in advance!

这根本行不通。有任何想法吗?提前致谢!

回答by Sam Barnum

"How now <fizz>brown</fizz> cow.".replaceAll("<[^>]+>", "")

回答by TheEwook

You were almost there ;)

你快到了 ;)

Try this:

试试这个:

str = str.replaceAll("<.*?>", "")

回答by Sergiu Toarca

While there are other correct answers, none give any explanation.

虽然还有其他正确答案,但没有一个给出任何解释。

The reason your regex <.*?>.*?</.*?>|<.*?/>doesn't work is because it will select any tags as well as everything inside them. You can see that in action on debuggex.

您的正则表达式<.*?>.*?</.*?>|<.*?/>不起作用的原因是它会选择任何标签以及其中的所有内容。您可以在debuggex上看到这一点

The reason your second attempt <.*?></.*?>|<.*?/>doesn't work is because it will select from the beginning of a tag up to the first close tag following a tag. That is kind of a mouthful, but you can understand better what's going on in this example.

您的第二次尝试<.*?></.*?>|<.*?/>不起作用的原因是因为它将从 tag 的开头到 tag之后的第一个结束标记进行选择。这有点啰嗦,但您可以更好地理解本示例中发生的事情。

The regex you need is much simpler: <.*?>. It simply selects every tag, ignoring if it's open/close. Visualization.

您需要的正则表达式要简单得多:<.*?>. 它只是选择每个标签,忽略它是否打开/关闭。可视化

回答by Sarath Kumar Sivan

You can try this too:

你也可以试试这个:

str = str.replaceAll("<.*?>", "");

Please have a look at the below example for better understanding:

请查看以下示例以更好地理解:

public class StringUtils {

    public static void main(String[] args) {
        System.out.println(StringUtils.replaceAll("How now <fizz>brown</fizz> cow."));
        System.out.println(StringUtils.replaceAll("How <buzz>now <fizz>brown</fizz><yoda/></buzz> cow."));
    }

    public static String replaceAll(String strInput) {
        return strInput.replaceAll("<.*?>", "");
    }
}

Output:

输出:

How now brown cow.
How now brown cow.

回答by Gayathry

This isn't elegant, but it is easy to follow. The below code removes the start and end XML tags if they are present in a line together

这并不优雅,但很容易遵循。下面的代码删除开始和结束 XML 标记(如果它们一起出现在一行中)

<url>"www.xml.com"<\url> , <body>"This is xml"<\body>

<url>"www.xml.com"<\url> , <body>"This is xml"<\body>

Regex :

正则表达式:

to_replace='<\w*>|<\/\w*>',value="" 

回答by Devarsh Modi

If you want to parse XML log file so you can do with regex {java}, <[^<]+<.so you get <name>DEV</name>. Output like name>DEV. You have to just play with REGEX.

如果你想解析 XML 日志文件,这样你就可以使用正则表达式 {java} <[^<]+<,.so 你得到<name>DEV</name>. 输出如名称> DEV。你只需要玩 REGEX。