在 Java 中过滤非法 XML 字符

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2897085/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 14:14:08  来源:igfitidea点击:

Filtering illegal XML characters in Java

javaxmlunicode

提问by Grzegorz Oledzki

XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.

XML 规范定义了 XML 文档中允许的 Unicode 字符子集:http: //www.w3.org/TR/REC-xml/#charsets

How do I filter out these characters from a String in Java?

如何从 Java 中的字符串中过滤掉这些字符?

simple test case:

简单的测试用例:

  Assert.equals("", filterIllegalXML(""+Character.valueOf((char) 2)))

采纳答案by ZZ Coder

It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,

找出 XML 的所有无效字符并非易事。您需要从 Xerces 调用或重新实现 XMLChar.isInvalid(),

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

回答by Bozho

Using StringEscapeUtils.escapeXml(xml)from commons-langwill escape, not filter the characters.

使用StringEscapeUtils.escapeXml(xml)from commons-lang将转义,而不是过滤字符。

回答by Tom Brito

You can use regex (Regular Expression)to do the work, see an example in the comments here

您可以使用regex(正则表达式)来完成这项工作,请参阅此处的评论中的示例

回答by Stephen C

This pageincludes a Java method for stripping out invalid XML charactersby testing whether each character is within spec, though it doesn't check for highly discouragedcharacters

此页面包含一个 Java 方法,用于通过测试每个字符是否在规范内来去除无效的 XML 字符,但它不会检查高度不鼓励的字符

Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.

顺便说一下,转义字符不是解决方案,因为 XML 1.0 和 1.1 规范也不允许转义形式的无效字符。

回答by gomesla

Here's a solution that takes care of the raw char as well as the escaped char in the stream works with stax or sax. It needs extending for the other invalid chars but you get the idea

这是一个解决方案,它处理原始字符以及流中的转义字符,适用于 stax 或 sax。它需要扩展其他无效字符,但您明白了

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.io.Writer;

import org.apache.commons.io.IOUtils;
import org.apache.xerces.util.XMLChar;

public class IgnoreIllegalCharactersXmlReader extends Reader {

    private final BufferedReader underlyingReader;
    private StringBuilder buffer = new StringBuilder(4096);
    private boolean eos = false;

    public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException {
        underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
    }

    private void fillBuffer() throws IOException {
        final String line = underlyingReader.readLine();
        if (line == null) {
            eos = true;
            return;
        }
        buffer.append(line);
        buffer.append('\n');
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        if(buffer.length() == 0 && eos) {
            return -1;
        }
        int satisfied = 0;
        int currentOffset = off;
        while (false == eos && buffer.length() < len) {
            fillBuffer();
        }
        while (satisfied < len && buffer.length() > 0) {
            char ch = buffer.charAt(0);
            final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '
boolean isAllValidXmlChars(String s) {
  // xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML
  if (!s.matches("[\u0001-\uD7FF\uE000-\uFFFD\x{10000}-\x{10FFFF}]")) {
    // not in valid ranges
    return false;
  }
  if (s.matches("[\u0001-\u0008\u000b-\u000c\u000E-\u001F\u007F-\u0084\u0086-\u009F]")) {
    // a control character
    return false;
  }

  // "Characters allowed but discouraged"
  if (s.matches(
    "[\uFDD0-\uFDEF\x{1FFFE}-\x{1FFFF}\x{2FFFE}–\x{2FFFF}\x{3FFFE}–\x{3FFFF}\x{4FFFE}–\x{4FFFF}\x{5FFFE}-\x{5FFFF}\x{6FFFE}-\x{6FFFF}\x{7FFFE}-\x{7FFFF}\x{8FFFE}-\x{8FFFF}\x{9FFFE}-\x{9FFFF}\x{AFFFE}-\x{AFFFF}\x{BFFFE}-\x{BFFFF}\x{CFFFE}-\x{CFFFF}\x{DFFFE}-\x{DFFFF}\x{EFFFE}-\x{EFFFF}\x{FFFFE}-\x{FFFFF}\x{10FFFE}-\x{10FFFF}]"
  )) {
    return false;
  }

  return true;
}
'; if (ch == '&' && nextCh == '#') { final StringBuilder entity = new StringBuilder(); // Since we're reading lines it's safe to assume entity is all // on one line so next char will/could be the hex char int index = 0; char entityCh = '##代码##'; // Read whole entity while (entityCh != ';') { entityCh = buffer.charAt(index++); entity.append(entityCh); } // if it's bad get rid of it and clean it from the buffer and point to next valid char if (entity.toString().equals("&#2;")) { buffer.delete(0, entity.length()); continue; } } if (XMLChar.isValid(ch)) { satisfied++; cbuf[currentOffset++] = ch; } buffer.deleteCharAt(0); } return satisfied; } @Override public void close() throws IOException { underlyingReader.close(); } public static void main(final String[] args) { final File file = new File( <XML>); final File outFile = new File(file.getParentFile(), file.getName() .replace(".xml", ".cleaned.xml")); Reader r = null; Writer w = null; try { r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file)); w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8"); IOUtils.copyLarge(r, w); w.flush(); } catch (Exception e) { e.printStackTrace(); } finally { IOUtils.closeQuietly(r); IOUtils.closeQuietly(w); } } }

回答by rogerdpack

Loosely based on a commentin the link from Stephen C's answer, and wikipedia for the XML 1.1 spechere's a java method that shows you how to remove illegal chars using regular expression replace:

松散地基于斯蒂芬 C 的回答链接中的评论,以及 XML 1.1规范的维基百科,这里有一个 java 方法,向您展示如何使用正则表达式替换删除非法字符:

##代码##

回答by stonar96

Use either escapeXml10or escapeXml11. These functions escape characters like ", &, ', <, >and a few more but also filter invalid characters.

使用escapeXml10escapeXml11。这些功能逃避像人物"&'<>和几个也滤除无效字符。

For those who don't want to filter invalid characters but escape them with a different escaping system, look at my answer here https://stackoverflow.com/a/59475093/3882565.

对于那些不想过滤无效字符但使用不同的转义系统转义它们的人,请在此处查看我的答案https://stackoverflow.com/a/59475093/3882565