在 Java 中剥离无效的 XML 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/93655/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-11 08:13:08  来源:igfitidea点击:

Stripping Invalid XML characters in Java

javaxml

提问by Mason

I have an XML file that's the output from a database. I'm using the Java SAX parser to parse the XML and output it in a different format. The XML contains some invalid characters and the parser is throwing errors like 'Invalid Unicode character (0x5)'

我有一个 XML 文件,它是数据库的输出。我正在使用 Java SAX 解析器来解析 XML 并以不同的格式输出它。XML 包含一些无效字符,解析器抛出错误,如“无效的 Unicode 字符 (0x5)”

Is there a good way to strip all these characters out besides pre-processing the file line-by-line and replacing them? So far I've run into 3 different invalid characters (0x5, 0x6 and 0x7). It's a ~4gb database dump and we're going to be processing it a bunch of times, so having to wait an extra 30 minutes each time we get a new dump to run a pre-processor on it is going to be a pain, and this isn't the first time I've run into this issue.

除了逐行预处理文件并替换它们之外,有没有什么好方法可以去除所有这些字符?到目前为止,我遇到了 3 个不同的无效字符(0x5、0x6 和 0x7)。这是一个大约 4GB 的数据库转储,我们将对其进行多次处理,因此每次获得新转储以在其上运行预处理器时必须额外等待 30 分钟,这将是一种痛苦,这不是我第一次遇到这个问题。

采纳答案by 18Rabbit

I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):

我个人没有使用过这个,但 Atlassian 制作了一个命令行 XML 清洁器,可能适合您的需要(它主要是为 JIRA 而制作的,但 XML 是 XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml

Run: java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

This will write a copy of data.xml to data-clean.xml, with invalid characters removed.

下载atlassian-xml-cleaner-0.1.jar

打开 DOS 控制台或 shell,在您的计算机上找到 XML 或 ZIP 备份文件,这里假设名为 data.xml

运行:java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

这会将 data.xml 的副本写入 data-clean.xml,并删除无效字符。

回答by scotty

Is it possible your invalid characters are present only within the values and not the tags themselves i.e. the XML notionally meets the schema but the values have not been properly sanitized? If so, what about overriding InputStream to create a CleansingInputStream that replaces your invalid characters with their XML equivalents?

是否有可能您的无效字符仅出现在值中,而不是标签本身,即 XML 理论上符合架构,但值尚未正确清理?如果是这样,那么如何覆盖 InputStream 以创建一个 CleansingInputStream 用它们的 XML 等效项替换您的无效字符?

回答by Confusion

Your problem does not concern XML: it concerns character encodings. What it comes down to is that every string, be it XML or otherwise, consists of bytes and you cannot know what characters these bytes represent, unless you are told what character encoding the string has. If, for instance, the supplier tells you it's UTF-8 and it's actually something else, you are bound to run into problems. In the best case, everything works, but some bytes are translated into 'wrong' characters. In the worst case you get errors like the one you encountered.

您的问题与 XML 无关:它与字符编码有关。归根结底是每个字符串,无论是 XML 还是其他形式,都由字节组成,并且您无法知道这些字节代表什么字符,除非您被告知字符串具有什么字符编码。例如,如果供应商告诉您它是 UTF-8 而实际上是其他东西,那么您一定会遇到问题。在最好的情况下,一切正常,但有些字节被翻译成“错误”的字符。在最坏的情况下,您会遇到与您遇到的类似的错误。

Actually, your problem is even worse: your string contains byte sequences that do not represent characters in any character encoding. There is no texthandling tool, let alone an XML parser, that can help you here. This needs byte-level cleaning up.

实际上,您的问题更糟:您的字符串包含不代表任何字符编码中的字符的字节序列。没有任何文本处理工具可以帮助您,更不用说 XML 解析器了。这需要字节级清理。

回答by ogrisel

I use the following regexp that seems to work as expected for the JDK6:

我使用以下 regexp 似乎对 JDK6 正常工作:

Pattern INVALID_XML_CHARS = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF]");
...
INVALID_XML_CHARS.matcher(stringToCleanup).replaceAll("");

In JDK7 it might be possible to use the notation \x{10000}-\x{10FFFF}for the last range that lies outside of the BMP instead of the \uD800\uDC00-\uDBFF\uDFFFnotation that is not as simple to understand.

在 JDK7 中,可能可以使用\x{10000}-\x{10FFFF}位于 BMP 之外的最后一个范围的\uD800\uDC00-\uDBFF\uDFFF表示法,而不是不太容易理解的表示法。

回答by jankar

I have a similar problem when parsing content of an Australian export tariffs into an XML document. I cannot use solutions suggested here such as: - Use an external tool (a jar) invoked from command line. - Ask Australian Customs to clean up the source file.

将澳大利亚出口关税的内容解析为 XML 文档时,我遇到了类似的问题。我不能使用此处建议的解决方案,例如: - 使用从命令行调用的外部工具(jar)。- 要求澳大利亚海关清理源文件。

The only method to solve this problem at the moment is to iterate through the entire content of the source file, character by character and test if each character does not belong to the ascii range 0x00 to 0x1F inclusively. It can be done, but I was wondering if there is a better way using Java methods for type String.

目前解决这个问题的唯一方法是逐个字符地遍历源文件的整个内容,并测试每个字符是否不属于 0x00 到 0x1F 的 ascii 范围。可以做到,但我想知道是否有更好的方法将 Java 方法用于 String 类型。

EDIT I found a solution that may be useful to others: Use Java method String#ReplaceAll to replace or remove any undesirable characters in XML document.

编辑我找到了一个可能对其他人有用的解决方案:使用 Java 方法 String#ReplaceAll 替换或删除 XML 文档中的任何不需要的字符。

Example code (I removed some necessary statements to avoid clutter):

示例代码(我删除了一些必要的语句以避免混乱):

BufferedReader reader = null;
...
String line = reader.readLine().replaceAll("[\x00-\x1F]", "");

In this example I remove (i.e. replace with an empty string), non-printable characters within range 0x00 to 0x1F inclusively. You can change the second argument in method #replaceAll() to replace characters with the string your application requires.

在这个例子中,我删除了(即用空字符串替换)0x00 到 0x1F 范围内的不可打印字符。您可以更改 #replaceAll() 方法中的第二个参数,以将字符替换为您的应用程序所需的字符串。

回答by Bozho

I used Xalan org.apache.xml.utils.XMLCharclass:

我使用了 Xalanorg.apache.xml.utils.XMLChar类:

public static String stripInvalidXmlCharacters(String input) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < input.length(); i++) {
        char c = input.charAt(i);
        if (XMLChar.isValid(c)) {
            sb.append(c);
        }
    }

    return sb.toString();
}