在 Java 中过滤非法 XML 字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2897085/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Filtering illegal XML characters in Java
提问by Grzegorz Oledzki
XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.
XML 规范定义了 XML 文档中允许的 Unicode 字符子集:http: //www.w3.org/TR/REC-xml/#charsets。
How do I filter out these characters from a String in Java?
如何从 Java 中的字符串中过滤掉这些字符?
simple test case:
简单的测试用例:
Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))
采纳答案by ZZ Coder
It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,
找出 XML 的所有无效字符并非易事。您需要从 Xerces 调用或重新实现 XMLChar.isInvalid(),
http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm
http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm
回答by Bozho
Using StringEscapeUtils.escapeXml(xml)
from commons-langwill escape, not filter the characters.
使用StringEscapeUtils.escapeXml(xml)
from commons-lang将转义,而不是过滤字符。
回答by Tom Brito
You can use regex (Regular Expression)to do the work, see an example in the comments here
您可以使用regex(正则表达式)来完成这项工作,请参阅此处的评论中的示例
回答by Stephen C
This pageincludes a Java method for stripping out invalid XML charactersby testing whether each character is within spec, though it doesn't check for highly discouragedcharacters
此页面包含一个 Java 方法,用于通过测试每个字符是否在规范内来去除无效的 XML 字符,但它不会检查高度不鼓励的字符
Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.
顺便说一下,转义字符不是解决方案,因为 XML 1.0 和 1.1 规范也不允许转义形式的无效字符。
回答by gomesla
Here's a solution that takes care of the raw char as well as the escaped char in the stream works with stax or sax. It needs extending for the other invalid chars but you get the idea
这是一个解决方案,它处理原始字符以及流中的转义字符,适用于 stax 或 sax。它需要扩展其他无效字符,但您明白了
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
import org.apache.commons.io.IOUtils;
import org.apache.xerces.util.XMLChar;
public class IgnoreIllegalCharactersXmlReader extends Reader {
private final BufferedReader underlyingReader;
private StringBuilder buffer = new StringBuilder(4096);
private boolean eos = false;
public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException {
underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
}
private void fillBuffer() throws IOException {
final String line = underlyingReader.readLine();
if (line == null) {
eos = true;
return;
}
buffer.append(line);
buffer.append('\n');
}
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
if(buffer.length() == 0 && eos) {
return -1;
}
int satisfied = 0;
int currentOffset = off;
while (false == eos && buffer.length() < len) {
fillBuffer();
}
while (satisfied < len && buffer.length() > 0) {
char ch = buffer.charAt(0);
final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : 'boolean isAllValidXmlChars(String s) {
// xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML
if (!s.matches("[\u0001-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]")) {
// not in valid ranges
return false;
}
if (s.matches("[\u0001-\u0008\u000b-\u000c\u000E-\u001F\u007F-\u0084\u0086-\u009F]")) {
// a control character
return false;
}
// "Characters allowed but discouraged"
if (s.matches(
"[\uFDD0-\uFDEF\x{1FFFE}-\x{1FFFF}\x{2FFFE}–\x{2FFFF}\x{3FFFE}–\x{3FFFF}\x{4FFFE}–\x{4FFFF}\x{5FFFE}-\x{5FFFF}\x{6FFFE}-\x{6FFFF}\x{7FFFE}-\x{7FFFF}\x{8FFFE}-\x{8FFFF}\x{9FFFE}-\x{9FFFF}\x{AFFFE}-\x{AFFFF}\x{BFFFE}-\x{BFFFF}\x{CFFFE}-\x{CFFFF}\x{DFFFE}-\x{DFFFF}\x{EFFFE}-\x{EFFFF}\x{FFFFE}-\x{FFFFF}\x{10FFFE}-\x{10FFFF}]"
)) {
return false;
}
return true;
}
';
if (ch == '&' && nextCh == '#') {
final StringBuilder entity = new StringBuilder();
// Since we're reading lines it's safe to assume entity is all
// on one line so next char will/could be the hex char
int index = 0;
char entityCh = '##代码##';
// Read whole entity
while (entityCh != ';') {
entityCh = buffer.charAt(index++);
entity.append(entityCh);
}
// if it's bad get rid of it and clean it from the buffer and point to next valid char
if (entity.toString().equals("")) {
buffer.delete(0, entity.length());
continue;
}
}
if (XMLChar.isValid(ch)) {
satisfied++;
cbuf[currentOffset++] = ch;
}
buffer.deleteCharAt(0);
}
return satisfied;
}
@Override
public void close() throws IOException {
underlyingReader.close();
}
public static void main(final String[] args) {
final File file = new File(
<XML>);
final File outFile = new File(file.getParentFile(), file.getName()
.replace(".xml", ".cleaned.xml"));
Reader r = null;
Writer w = null;
try {
r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file));
w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8");
IOUtils.copyLarge(r, w);
w.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(r);
IOUtils.closeQuietly(w);
}
}
}
回答by rogerdpack
回答by stonar96
Use either escapeXml10or escapeXml11. These functions escape characters like "
, &
, '
, <
, >
and a few more but also filter invalid characters.
使用escapeXml10或escapeXml11。这些功能逃避像人物"
,&
,'
,<
,>
和几个也滤除无效字符。
For those who don't want to filter invalid characters but escape them with a different escaping system, look at my answer here https://stackoverflow.com/a/59475093/3882565.
对于那些不想过滤无效字符但使用不同的转义系统转义它们的人,请在此处查看我的答案https://stackoverflow.com/a/59475093/3882565。