从java中的字符串中删除无效的XML字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4237625/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
removing invalid XML characters from a string in java
提问by yossi
Hi i would like to remove all invalid XML characters from a string. i would like to use a regular expression with the string.replace method.
嗨,我想从字符串中删除所有无效的 XML 字符。我想在 string.replace 方法中使用正则表达式。
like
喜欢
line.replace(regExp,"");
line.replace(regExp,"");
what is the right regExp to use ?
什么是正确的 regExp 使用?
invalid XML character is everything that is not this :
无效的 XML 字符是除此之外的所有内容:
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
thanks.
谢谢。
采纳答案by McDowell
Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.
Java 的正则表达式支持补充字符,因此您可以使用两个 UTF-16 编码字符指定那些高范围。
Here is the pattern for removing characters that are illegal in XML 1.0:
这是用于删除XML 1.0中非法字符的模式:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
Most people will want the XML 1.0 version.
大多数人会想要 XML 1.0 版本。
Here is the pattern for removing characters that are illegal in XML 1.1:
这是用于删除XML 1.1中非法字符的模式:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "\u0001-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]+";
You will need to use String.replaceAll(...)
and not String.replace(...)
.
您将需要使用String.replaceAll(...)
而不是String.replace(...)
.
String illegal = "Hello, World! /**
* This method ensures that the output String has only
* valid XML unicode characters as specified by the
* XML 1.0 standard. For reference, please see
* <a href="http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char">the
* standard</a>. This method will return an empty
* String if the input is null or empty.
*
* @param in The String whose non-valid characters we want to remove.
* @return The in String, stripped of non-valid characters.
*/
public static String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen.
if ((current == 0x9) ||
(current == 0xA) ||
(current == 0xD) ||
((current >= 0x20) && (current <= 0xD7FF)) ||
((current >= 0xE000) && (current <= 0xFFFD)) ||
((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
";
String legal = illegal.replaceAll(pattern, "");
回答by AlexR
I believe that the following articles may help you.
相信下面的文章可以帮到你。
http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.htmlhttp://www.javapractices.com/topic/TopicAction.do?Id=96
http://commons.apache.org/lang/api-2.1/org/apache/commons/lang/StringEscapeUtils.htmlhttp://www.javapractices.com/topic/TopicAction.do?Id=96
Shortly, try to use StringEscapeUtils from Jakarta project.
很快,尝试使用 Jakarta 项目中的 StringEscapeUtils。
回答by Renaud
if (null == text || text.isEmpty()) {
return text;
}
final int len = text.length();
char current = 0;
int codePoint = 0;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < len; i++) {
current = text.charAt(i);
boolean surrogate = false;
if (Character.isHighSurrogate(current)
&& i + 1 < len && Character.isLowSurrogate(text.charAt(i + 1))) {
surrogate = true;
codePoint = text.codePointAt(i++);
} else {
codePoint = current;
}
if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
|| ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
|| ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
|| ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
sb.append(current);
if (surrogate) {
sb.append(text.charAt(i));
}
}
}
回答by Jun
Should we consider surrogate characters? otherwise '(current >= 0x10000) && (current <= 0x10FFFF)' will never be true.
我们应该考虑代理字符吗?否则 '(current >= 0x10000) && (current <= 0x10FFFF)' 永远不会为真。
Also tested that the regex way seems slower than the following loop.
还测试了正则表达式的方式似乎比以下循环慢。
StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
int codePoint = text.codePointAt(i);
if (codePoint > 0xFFFF) {
i++;
}
if ((codePoint == 0x9) || (codePoint == 0xA) || (codePoint == 0xD)
|| ((codePoint >= 0x20) && (codePoint <= 0xD7FF))
|| ((codePoint >= 0xE000) && (codePoint <= 0xFFFD))
|| ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
sb.appendCodePoint(codePoint);
}
}
回答by Vlasec
Jun's solution, simplified. Using StringBuffer#appendCodePoint(int)
, I need no char current
or String#charAt(int)
. I can tell a surrogate pair by checking if codePoint
is greater than 0xFFFF
.
君的解决方案,简化了。使用StringBuffer#appendCodePoint(int)
,我不需要char current
或String#charAt(int)
。我可以通过检查是否codePoint
大于来判断代理对0xFFFF
。
(It is not necessary to do the i++, since a low surrogate wouldn't pass the filter. But then one would re-use the code for different code points and it would fail. I prefer programming to hacking.)
(没有必要执行 i++,因为低代理不会通过过滤器。但随后会为不同的代码点重新使用代码并且它会失败。我更喜欢编程而不是黑客。)
String xmlEscapeText(String t) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < t.length(); i++){
char c = t.charAt(i);
switch(c){
case '<': sb.append("<"); break;
case '>': sb.append(">"); break;
case '\"': sb.append("""); break;
case '&': sb.append("&"); break;
case '\'': sb.append("'"); break;
default:
if(c>0x7e) {
sb.append("&#"+((int)c)+";");
}else
sb.append(c);
}
}
return sb.toString();
}
回答by Roger F. Gay
From Best way to encode text data for XML in Java?
public final Pattern XML_ENTITY_PATTERN = Pattern.compile("\&\#(?:x([0-9a-fA-F]+)|([0-9]+))\;");
/**
* Remove problematic xml entities from the xml string so that you can parse it with java DOM / SAX libraries.
*/
String getCleanedXml(String xmlString) {
Matcher m = XML_ENTITY_PATTERN.matcher(xmlString);
Set<String> replaceSet = new HashSet<>();
while (m.find()) {
String group = m.group(1);
int val;
if (group != null) {
val = Integer.parseInt(group, 16);
if (isInvalidXmlChar(val)) {
replaceSet.add("&#x" + group + ";");
}
} else if ((group = m.group(2)) != null) {
val = Integer.parseInt(group);
if (isInvalidXmlChar(val)) {
replaceSet.add("&#" + group + ";");
}
}
}
String cleanedXmlString = xmlString;
for (String replacer : replaceSet) {
cleanedXmlString = cleanedXmlString.replaceAll(replacer, "");
}
return cleanedXmlString;
}
private boolean isInvalidXmlChar(int val) {
if (val == 0x9 || val == 0xA || val == 0xD ||
val >= 0x20 && val <= 0xD7FF ||
val >= 0x10000 && val <= 0x10FFFF) {
return false;
}
return true;
}
回答by Roger F. Gay
If you want to store text elements with the forbidden characters in XML-like form, you can use XPL instead. The dev-kit provides concurrent XPL to XML and XML processing - which means no time cost to the translation from XPL to XML. Or, if you don't need the full power of XML (namespaces), you can just use XPL.
如果要以类似 XML 的形式存储带有禁用字符的文本元素,可以改用 XPL。开发工具包提供并发的 XPL 到 XML 和 XML 处理——这意味着从 XPL 到 XML 的转换没有时间成本。或者,如果您不需要 XML(命名空间)的全部功能,您可以只使用 XPL。
回答by Nicholas DiPiazza
All these answers so far only replace the characters themselves. But sometimes an XML document will have invalid XML entity sequences resulting in errors. For example, if you have 
in your xml, a java xml parser will throw Illegal character entity: expansion character (code 0x2 at ...
.
到目前为止,所有这些答案都只是替换了角色本身。但有时 XML 文档会包含无效的 XML 实体序列,从而导致错误。例如,如果您
的 xml 中有,则 java xml 解析器将抛出Illegal character entity: expansion character (code 0x2 at ...
.
Here is a simple java program that can replace those invalid entity sequences.
这是一个简单的java程序,可以替换那些无效的实体序列。
String xmlData = xmlData.codePoints().filter(c -> isValidXMLChar(c)).collect(StringBuilder::new,
StringBuilder::appendCodePoint, StringBuilder::append).toString();
private boolean isValidXMLChar(int c) {
if((c == 0x9) ||
(c == 0xA) ||
(c == 0xD) ||
((c >= 0x20) && (c <= 0xD7FF)) ||
((c >= 0xE000) && (c <= 0xFFFD)) ||
((c >= 0x10000) && (c <= 0x10FFFF)))
{
return true;
}
return false;
}