使用声明的 encoding=utf-8 从 xml 中删除非 UTF-8 字符 - Java

Question

提问by St Nietzke

I have to handle this scenario in Java:

我必须在 Java 中处理这种情况：

I'm getting a request in XML form from a client with declared encoding=utf-8. Unfortunately it may contain not utf-8 characters and there is a requirement to remove these characters from the xml on my side (legacy).

我从声明为 encoding=utf-8 的客户端收到 XML 格式的请求。不幸的是，它可能不包含 utf-8 字符，并且需要从我这边的 xml（遗留）中删除这些字符。

Let's consider an example where this invalid XML contains ￡ (pound).

让我们考虑一个示例，其中此无效 XML 包含￡（英镑）。

1) I get xml as java String with ￡ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(￡, "") to get rid of this character? Any potential issues?

1) 我将 xml 作为 java 字符串获取，其中包含￡（我现在无法访问接口，但我可能将 xml 作为 java 字符串获取）。我可以使用 replaceAll(￡, "") 来去掉这个字符吗？任何潜在的问题？

2) I get xml as an array of bytes - how to handle this operation safely in that case?

2）我得到 xml 作为一个字节数组 - 在这种情况下如何安全地处理这个操作？

Answer 1

回答by BalusC

1) I get xml as java String with ￡ in it (I don't have access to interface right now, but I probably get xml as a java String). Can I use replaceAll(￡, "") to get rid of this character?

1) 我将 xml 作为 java 字符串获取，其中包含￡（我现在无法访问接口，但我可能将 xml 作为 java 字符串获取）。我可以使用 replaceAll(￡, "") 来去掉这个字符吗？

I am assuming that you rather mean that you want to get rid of non-ASCIIcharacters, because you're talking about a "legacy" side. You can get rid of anything outside the printable ASCII rangeusing the following regex:

我假设你的意思是你想要摆脱非ASCII字符，因为你在谈论“遗留”方面。您可以使用以下正则表达式摆脱可打印 ASCII 范围之外的任何内容：

string = string.replaceAll("[^\x20-\x7e]", "");

2) I get xml as an array of bytes - how to handle this operation safely in that case?

2）我得到 xml 作为一个字节数组 - 在这种情况下如何安全地处理这个操作？

You need to wrap the byte[]in an ByteArrayInputStream, so that you can read them in an UTF-8 encoded character stream using InputStreamReaderwherein you specify the encoding and then use a BufferedReaderto read it line by line.

您需要将包装byte[]在 an 中ByteArrayInputStream，以便您可以使用InputStreamReader其中指定编码的 UTF-8 编码字符流读取它们，然后使用 aBufferedReader逐行读取。

E.g.

例如

BufferedReader reader = null;
try {
    reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("[^\x20-\x7e]", "");
        // ...
    }
    // ...

Answer 2

回答by Sean Owen

UTF-8 is an encoding; Unicode is a character set. But the GBP symbol is most definitely in the Unicode character set and therefore most certainly representable in UTF-8.

UTF-8 是一种编码；Unicode 是一个字符集。但是英镑符号绝对是在 Unicode 字符集中，因此肯定可以用 UTF-8 表示。

If you do in fact mean UTF-8, and you are actually trying to remove byte sequences that are not the valid encoding of a character in UTF-8, then...

如果您实际上是指 UTF-8，并且您实际上是在尝试删除不是 UTF-8 中字符的有效编码的字节序列，那么...

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ...;
CharBuffer parsed = utf8Decoder.decode(bytes);
...

Answer 3

回答by Kapil

I faced the same problem while reading files from a local directory and tried this:

我在从本地目录读取文件时遇到了同样的问题并尝试了这个：

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "UTF-8"));
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDom = db.parse(new InputSource(in));

You might have to use your network input stream instead of FileInputStream.

您可能必须使用网络输入流而不是 FileInputStream。

-- Kapil

——卡皮尔

Answer 4

回答by melih onem

"test text".replaceAll("[^\u0000-\uFFFF]", "");

This code removes all 4-byte utf8 chars from string.This can be needed for some purposes while doing Mysql innodb varchar entry

此代码从字符串中删除所有 4 字节 utf8 字符。在执行 Mysql innodb varchar 条目时，出于某些目的可能需要这样做

Answer 5

回答by Thorbj?rn Ravn Andersen

Note that the first step should be that you ask the creator of the XML (which is most likely a home grown "just print data" XML generator) to ensure that their XML is correct before sending to you. The simplest possible test if they use Windows is to ask them to view it in Internet Explorer and see the parsing error at the first offending character.

请注意，第一步应该是您要求 XML 的创建者（很可能是自产的“仅打印数据”XML 生成器）在发送给您之前确保他们的 XML 是正确的。如果他们使用 Windows，最简单的测试是让他们在 Internet Explorer 中查看它并查看第一个违规字符的解析错误。

While they fix that, you can simply write a small program that change the header part to declare that the encoding is ISO-8859-1 instead:

虽然他们解决了这个问题，但您可以简单地编写一个小程序来更改标头部分以声明编码为 ISO-8859-1：

<?xml version="1.0" encoding="iso-8859-1" ?>

and leave the rest untouched.

其余的不受影响。

Answer 6

回答by despot

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

在 java 机器上将字节数组转换为 String 后，您将获得（在大多数机器上默认）UTF-16 编码的字符串。摆脱非 UTF-8 字符的正确解决方案是使用以下代码：

String[] values = {"\xF0\x9F\x98\x95", "\xF0\x9F\x91\x8C", "/*", "look into my eyes ?.?", "fkdjsf ksdjfslk", "\xF0\x80\x80\x80", "aa \xF0\x9F\x98\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    "[\\x00-\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\xC0-\\xDF][\\x80-\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\xE0-\\xEF][\\x80-\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\xF0-\\xF7][\\x80-\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

或者，如果您想验证某个字符串是否包含非 utf8 字符，您可以使用 Pattern.matches，例如：

String[] values = {"\xF0\x9F\x98\x95", "\xF0\x9F\x91\x8C", "/*", "look into my eyes ?.?", "fkdjsf ksdjfslk", "\xF0\x80\x80\x80", "aa \xF0\x9F\x98\x95 aa"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    "[\\x00-\\x7F]|" + //single-byte sequences   0xxxxxxx
                    "[\\xC0-\\xDF][\\x80-\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\xE0-\\xEF][\\x80-\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\xF0-\\xF7][\\x80-\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

If you have the byte array available than you could filter them even more properly with:

如果您有可用的字节数组，则可以使用以下命令更正确地过滤它们：

BufferedReader bufferedReader = null;
try {
    bufferedReader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(bytes), "UTF-8"));
    for (String currentLine; (currentLine = bufferedReader.readLine()) != null;) {
        currentLine = currentLine.replaceAll(
                        "[\x00-\x7F]|" + //single-byte sequences   0xxxxxxx
                        "[\xC0-\xDF][\x80-\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                        "[\xE0-\xEF][\x80-\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                        "[\xF0-\xF7][\x80-\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                , ""));
    }

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

要使整个 Web 应用程序与 UTF8 兼容，请阅读此处：
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings。
你可以在这里检查你的模式。
在PHP中的相同位置。

使用声明的 encoding=utf-8 从 xml 中删除非 UTF-8 字符 - Java

提问by St Nietzke

回答by BalusC

回答by Sean Owen

回答by Kapil

回答by melih onem

回答by Thorbj?rn Ravn Andersen

回答by despot

相关推荐

最近更新

标签

使用声明的 encoding=utf-8 从 xml 中删除非 UTF-8 字符 - Java

提问by St Nietzke

回答by BalusC

回答by Sean Owen

回答by Kapil

回答by melih onem

回答by Thorbj?rn Ravn Andersen

回答by despot

相关推荐

Java 如何使用char数组在没有String方法的情况下查找子字符串？

Java Spring/Hibernate/Junit 针对 HSQLDB 测试 DAO 的示例

无法在 Eclipse-Java 中编译/运行

WAS 6.1 java.lang.VerifyError：违反了类加载约束

相关推荐

最近更新

标签