在 Java 中使用 REGEX 解析 XML

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/335250/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 13:34:35  来源:igfitidea点击:

Parsing XML with REGEX in Java

javaxmlregex

提问by Mocky

Given the below XML snippet I need to get a list of name/value pairs for each child under DataElements. XPath or an XML parser cannot be used for reasons beyond my control so I am using regex.

鉴于以下 XML 片段,我需要获取 DataElements 下每个子项的名称/值对列表。由于我无法控制的原因,无法使用 XPath 或 XML 解析器,因此我使用了正则表达式。

<?xml version="1.0"?>
<StandardDataObject xmlns="myns">
  <DataElements>
    <EmpStatus>2.0</EmpStatus>
    <Expenditure>95465.00</Expenditure>
    <StaffType>11.A</StaffType>
    <Industry>13</Industry>
  </DataElements>
  <InteractionElements>
    <TargetCenter>92f4-MPA</TargetCenter>
    <Trace>7.19879</Trace>
  </InteractionElements>
</StandardDataObject>

The output I need is: [{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

我需要的输出是:[{EmpStatus:2.0}, {Expenditure:95465.00}, {StaffType:11.A}, {Industry:13}]

The tag names under DataElements are dynamic and so cannot be expressed literally in the regex. The tag names TargetCenter and Trace are static and could be in the regex but if there is a way to avoid hardcoding that would be preferable.

DataElements 下的标记名称是动态的,因此不能在正则表达式中逐字表达。标签名称 TargetCenter 和 Trace 是静态的,可以在正则表达式中,但如果有办法避免硬编码,那将是更可取的。

"<([A-Za-z0-9]+?)>([A-Za-z0-9.]*?)</"

This is the regex I have constructed and it has the problem that it erroneously includes {Trace:719879} in the results. Relying on new-lines within the XML or any other apparent formatting is not an option.

这是我构建的正则表达式,它的问题是它错误地在结果中包含了 {Trace:719879}。依赖 XML 中的换行符或任何其他明显的格式都不是一种选择。

Below is an approximation of the Java code I am using:

下面是我使用的 Java 代码的近似值:

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private List<DataElement> listDataElements(CharSequence cs) {
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

How can I change my regex to only include data elements and ignore the rest?

如何将我的正则表达式更改为仅包含数据元素而忽略其余元素?

采纳答案by Jan Goyvaerts

This should work in Java, if you can assume that between the DataElements tags, everything has the form value. I.e. no attributes, and no nested elements.

这应该适用于 Java,如果您可以假设在 DataElements 标记之间,一切都具有表单值。即没有属性,也没有嵌套元素。

Pattern regex = Pattern.compile("<DataElements>(.*?)</DataElements>", Pattern.DOTALL);
Matcher matcher = regex.matcher(subjectString);
Pattern regex2 = Pattern.compile("<([^<>]+)>([^<>]+)</\1>");
if (matcher.find()) {
    String DataElements = matcher.group(1);
    Matcher matcher2 = regex2.matcher(DataElements);
    while (matcher2.find()) {
        list.add(new DataElement(matcher2.group(1), matcher2.group(2)));
    } 
}

回答by Alnitak

Is there any reason you're not using a proper XML parser instead of regex's? This would be trivial with the right library.

您是否有任何理由不使用正确的 XML 解析器而不是正则表达式?对于正确的库,这将是微不足道的。

回答by activout.se

Use XPath instead!

改用 XPath!

回答by Greg

You really should be using an XML library for this.

您确实应该为此使用 XML 库。

If you have to use RE, why not do it in two stages? DataElements>.*?</DataElementsthen what you have now.

如果必须使用 RE,为什么不分两个阶段进行呢? DataElements>.*?</DataElements那么你现在拥有的。

回答by Guemundur Bjarni

Sorry to give you yet another "Don't use regex" answer, but seriously. Please use Commons-Digester, JAXP(bundled with Java 5+) or JAXB(bundled with Java 6+) as it will save you from a boatload of hurt.

很抱歉给你另一个“不要使用正则表达式”的答案,但很认真。请使用Commons-DigesterJAXP(与 Java 5+ 捆绑)或JAXB(与 Java 6+ 捆绑),因为它可以使您免于受到伤害。

回答by Dour High Arch

XML is not a regular language. You cannotparse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.

XML 不是常规语言。您不能使用正则表达式解析它。当您获得嵌套标签时,您认为可以工作的表达式会中断,然后当您修复它时,它将在 XML 注释、CDATA 部分、处理器指令、名称空间等处中断……它无法工作,请使用 XML 解析器。

回答by James Van Huis

You should listen to everyone. A lightweight parser is a bad idea.

你应该听每个人的。轻量级解析器是个坏主意。

However, if you are really that hard headed about it, you should be able to tweak your code to exclude the tags outside of the DataElements tag.

但是,如果您真的很认真,您应该能够调整您的代码以排除 DataElements 标记之外的标记。

private static final Pattern PATTERN_1 = Pattern.compile(..REGEX..);
private static final String START_TAG = "<DataElements>";
private static final String END_TAG = "</DataElements>";
private List<DataElement> listDataElements(String input) {
    String cs = input.substring(input.indexOf(START_TAG) + START_TAG.length(), input.indexOf(END_TAG);
    List<DataElement> list = new ArrayList<DataElement>();
    Matcher matcher = PATTERN_1.matcher(cs);
    while (matcher.find()) {
        list.add(new DataElement(matcher.group(1), matcher.group(2)));
    }
    return list;
}

This will fail horribly if the dataelements tag does not exist.

如果 dataelements 标签不存在,这将失败。

Once again, this is a bad idea, and you will likely be revisiting this piece of code some time in the future in the form of a bug report.

再一次,这是一个坏主意,您可能会在将来的某个时间以错误报告的形式重新访问这段代码。

回答by Amith Perera

Try to parse the Reg Ex via a property file and create then pattern object. I sorted out the same issue I faced while injecting Reg Ex via xml beans.

尝试通过属性文件解析 Reg Ex 并创建然后模式对象。我解决了在通过 xml bean 注入 Reg Ex 时遇到的相同问题。

Ex :- I needed to parse the Reg Ex '(.)(D[0-9]{7}.D[0-9]{9}.D[A-Z]{3}[0-9]{4})(.)' by injecting in Spring. But it didn't work. Once tried to use the same Reg Ex hard coded in a Java class it worked.

例如:- 我需要解析 Reg Ex '(.)(D[0-9]{7}.D[0-9]{9}.D[AZ]{3}[0-9]{4} )(.)' 通过在 Spring 中注入。但它没有用。一旦尝试在 Java 类中使用相同的 Reg Ex 硬编码,它就起作用了。

Pattern pattern = Pattern.compile("(.)(D[0-9]{7}.D[0-9]{9}.D[A-Z]{2}[0-9]{4})(.)"); Matcher matcher = pattern.matcher(file.getName().trim());

模式模式 = Pattern.compile("(.)(D[0-9]{7}.D[0-9]{9}.D[AZ]{2}[0-9]{4})(. )"); 匹配器 matcher = pattern.matcher(file.getName().trim());

Next I tried to load that Reg Ex via property file while injecting it. It worked fine.

  p:remoteDirectory="${rawDailyReport.remote.download.dir}"
  p:localDirectory="${rawDailyReport.local.valid.dir}"
  p:redEx="${rawDailyReport.download.regex}"

And in the property file the property is defined as follows.

在属性文件中,属性定义如下。

rawDailyReport.download.regex=(.)(D[0-9]{7}\.D[0-9]{9}\.D[A-Z]{2}[0-9]{4})(.)

rawDailyReport.download.regex=(. )(D[0-9]{7}\.D[0-9]{9}\.D[AZ]{2}[0-9]{4})(.)

This is because the values with place holders are loaded through org.springframework.beans.factory.config.PropertyPlaceholderConfigurer and it handles these xml sensitive characters internally.

这是因为带有占位符的值是通过 org.springframework.beans.factory.config.PropertyPlaceholderConfigurer 加载的,它在内部处理这些 xml 敏感字符。

Thanks, Amith

谢谢,阿米特