从java中的字符串中删除无效的XML字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4237625/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-14 14:53:15  来源:igfitidea点击:

removing invalid XML characters from a string in java

javaxmlregexinvalid-characters

提问by yossi

Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.

嗨,我想从字符串中删除所有无效的 XML 字符。我想在 string.replace 方法中使用正则表达式。

like

喜欢

line.replace(regExp,"");

line.replace(regExp,"");

what is the right regExp to use ?

什么是正确的 regExp 使用?

invalid XML character is everything that is not this :

无效的 XML 字符是除此之外的所有内容:

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

thanks.

谢谢。

采纳答案by McDowell

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Java 的正则表达式支持补充字符,因此您可以使用两个 UTF-16 编码字符指定那些高范围。

Here is the pattern for removing characters that are illegal in XML 1.0:

这是用于删除XML 1.0中非法字符的模式:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]";

Most people will want the XML 1.0 version.

大多数人会想要 XML 1.0 版本。

Here is the pattern for removing characters that are illegal in XML 1.1:

这是用于删除XML 1.1中非法字符的模式:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]+";

You will need to use String.replaceAll(...)and not String.replace(...).

您将需要使用String.replaceAll(...)而不是String.replace(...).

String illegal = "Hello, World!
  /**
   * This method ensures that the output String has only
   * valid XML unicode characters as specified by the
   * XML 1.0 standard. For reference, please see
   * <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
   * standard</a>. This method will return an empty
   * String if the input is null or empty.
   *
   * @param in The String whose non-valid characters we want to remove.
   * @return The in String, stripped of non-valid characters.
   */
  public static String stripNonValidXMLCharacters(String in) {
      StringBuffer out = new StringBuffer(); // Used to hold the output.
      char current; // Used to reference the current character.

      if (in == null || ("".equals(in))) return ""; // vacancy test.
      for (int i = 0; i < in.length(); i++) {
          current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
          if ((current == 0x9) ||
              (current == 0xA) ||
              (current == 0xD) ||
              ((current >= 0x20) && (current <= 0xD7FF)) ||
              ((current >= 0xE000) && (current <= 0xFFFD)) ||
              ((current >= 0x10000) && (current <= 0x10FFFF)))
              out.append(current);
      }
      return out.toString();
  }   
"; String legal = illegal.replaceAll(pattern, "");

回答by AlexR

I believe that the following articles may help you.

相信下面的文章可以帮到你。

http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.htmlhttp://www.javapractices.com/topic/TopicAction.do?Id=96

http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.htmlhttp://www.javapractices.com/topic/TopicAction.do?Id=96

Shortly, try to use StringEscapeUtils from Jakarta project.

很快,尝试使用 Jakarta 项目中的 StringEscapeUtils。

回答by Renaud

From Mark McLaren's Weblog

来自马克麦克拉伦的博客

if (null == text || text.isEmpty()) {
    return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
    current = text.charAt(i);
    boolean surrogate = false;
    if (Character.isHighSurrogate(current)
            && i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
        surrogate = true;
        codePoint = text.codePointAt(i++);
    } else {
        codePoint = current;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.append(current);
        if (surrogate) {
            sb.append(text.charAt(i));
        }
    }
}

回答by Jun

Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.

我们应该考虑代理字符吗?否则 '(current >= 0x10000) && (current <= 0x10FFFF)' 永远不会为真。

Also tested that the regex way seems slower than the following loop.

还测试了正则表达式的方式似乎比以下循环慢。

StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
    int codePoint = text.codePointAt(i);
    if (codePoint > 0xFFFF) {
        i++;
    }
    if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
            || ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
            || ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
            || ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
        sb.appendCodePoint(codePoint);
    }
}

回答by Vlasec

Jun's solution, simplified. Using StringBuffer#appendCodePoint(int), I need no char currentor String#charAt(int). I can tell a surrogate pair by checking if codePointis greater than 0xFFFF.

君的解决方案,简化了。使用StringBuffer#appendCodePoint(int),我不需要char currentString#charAt(int)。我可以通过检查是否codePoint大于来判断代理对0xFFFF

(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)

(没有必要执行 i++,因为低代理不会通过过滤器。但随后会为不同的代码点重新使用代码并且它会失败。我更喜欢编程而不是黑客。)

String xmlEscapeText(String t) {
   StringBuilder sb = new StringBuilder();
   for(int i = 0; i < t.length(); i++){
      char c = t.charAt(i);
      switch(c){
      case '<': sb.append("&lt;"); break;
      case '>': sb.append("&gt;"); break;
      case '\"': sb.append("&quot;"); break;
      case '&': sb.append("&amp;"); break;
      case '\'': sb.append("&apos;"); break;
      default:
         if(c>0x7e) {
            sb.append("&#"+((int)c)+";");
         }else
            sb.append(c);
      }
   }
   return sb.toString();
}

回答by Roger F. Gay

From Best way to encode text data for XML in Java?

在 Java 中为 XML 编码文本数据的最佳方法?

  public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\&\#(?:x([0-9a-fA-F]+)|([0-9]+))\;");

  /**
   * Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
   */
  String getCleanedXml(String xmlString) {
    Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
    Set<String> replaceSet = new HashSet<>();
    while (m.find()) {
      String group = m.group(1);
      int val;
      if (group != null) {
        val = Integer.parseInt(group, 16);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#x" + group + ";");
        }
      } else if ((group = m.group(2)) != null) {
        val = Integer.parseInt(group);
        if (isInvalidXmlChar(val)) {
          replaceSet.add("&#" + group + ";");
        }
      }
    }
    String cleanedXmlString = xmlString;
    for (String replacer : replaceSet) {
      cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
    }
    return cleanedXmlString;
  }

  private boolean isInvalidXmlChar(int val) {
    if (val == 0x9 || val == 0xA || val == 0xD ||
            val >= 0x20 && val <= 0xD7FF ||
            val >= 0x10000 && val <= 0x10FFFF) {
      return false;
    }
    return true;
  }

回答by Roger F. Gay

If you want to store text elements with the forbidden characters in XML-like form, you can use XPL instead. The dev-kit provides concurrent XPL to XML and XML processing - which means no time cost to the translation from XPL to XML. Or, if you don't need the full power of XML (namespaces), you can just use XPL.

如果要以类似 XML 的形式存储带有禁用字符的文本元素,可以改用 XPL。开发工具包提供并发的 XPL 到 XML 和 XML 处理——这意味着从 XPL 到 XML 的转换没有时间成本。或者,如果您不需要 XML(命名空间)的全部功能,您可以只使用 XPL。

Web Page: HLL XPL

网页:HLL XPL

回答by Nicholas DiPiazza

All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have &#2;in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ....

到目前为止,所有这些答案都只是替换了角色本身。但有时 XML 文档会包含无效的 XML 实体序列,从而导致错误。例如,如果您&#2;的 xml 中有,则 java xml 解析器将抛出Illegal character entity: expansion character (code 0x2 at ....

Here is a simple java program that can replace those invalid entity sequences.

这是一个简单的java程序,可以替换那些无效的实体序列。

String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new,
                StringBuilder::appendCodePoint, StringBuilder::append).toString();

private boolean isValidXMLChar(int c) {
    if((c == 0x9) ||
       (c == 0xA) ||
       (c == 0xD) ||
       ((c >= 0x20) && (c <= 0xD7FF)) ||
       ((c >= 0xE000) && (c <= 0xFFFD)) ||
       ((c >= 0x10000) && (c <= 0x10FFFF)))
    {
        return true;
    }
    return false;
}

回答by Hans Schreuder

##代码##