如何检查字节数组是否包含 Java 中的 Unicode 字符串？

Question

提问by Iain

Given a byte array that is either a UTF-8 encoded string or arbitrary binary data, what approaches can be used in Javato determine which it is?

给定一个字节数组，它要么是 UTF-8 编码的字符串，要么是任意的二进制数据，在 Java 中可以使用哪些方法来确定它是哪一种？

The array may be generated by code similar to:

该数组可以由类似于以下的代码生成：

byte[] utf8 = "Hello World".getBytes("UTF-8");

Alternatively it may have been generated by code similar to:

或者，它可能是由类似于以下内容的代码生成的：

byte[] messageContent = new byte[256];
for (int i = 0; i < messageContent.length; i++) {
    messageContent[i] = (byte) i;
}

The key point is that we don't know what the array contains but need to find out in order to fill in the following function:

关键是我们不知道数组包含什么，但需要找出来填充以下函数：

public final String getString(final byte[] dataToProcess) {
    // Determine whether dataToProcess contains arbitrary data or a UTF-8 encoded string
    // If dataToProcess contains arbitrary data then we will BASE64 encode it and return.
    // If dataToProcess contains an encoded string then we will decode it and return.
}

How would this be extended to also cover UTF-16 or other encoding mechanisms?

这将如何扩展以涵盖 UTF-16 或其他编码机制？

Answer 1

回答by Michael Borgwardt

It's not possible to make that decision with full accuracy in all cases, because an UTF-8 encoded string isone kind of arbitrary binary data, but you can look for byte sequences that are invalid in UTF-8. If you find any, you know that it's not UTF-8.

不可能在所有情况下都完全准确地做出该决定，因为 UTF-8 编码的字符串是一种任意二进制数据，但您可以查找在 UTF-8中无效的字节序列。如果你找到了，你就知道它不是 UTF-8。

If you array is large enough, this should work out well since it is very likely for such sequences to appear in "random" binary data such as compressed data or image files.

如果您的数组足够大，这应该很有效，因为此类序列很可能出现在“随机”二进制数据中，例如压缩数据或图像文件。

However, it is possible to get valid UTF-8 data that decodes to a totally nonsensical string of characters (probably from all kinds of diferent scripts). This is more likely with short sequences. If you're worried about that, you might have to do a closer analysis to see whether the characters that are letters all belong to the same code chart. Then again, this may yield false negatives when you have valid text input that mixes scripts.

但是，有可能获得有效的 UTF-8 数据，这些数据解码为完全无意义的字符串（可能来自各种不同的脚本）。这对于短序列更有可能。如果您对此感到担心，则可能需要进行更仔细的分析，以查看作为字母的字符是否都属于同一个代码表。再说一次，当您有混合脚本的有效文本输入时，这可能会产生误报。

Answer 2

回答by Alan Moore

Here's a way to use the UTF-8 "binary" regex from the W3C site

这是一种使用W3C 站点中的 UTF-8“二进制”正则表达式的方法

static boolean looksLikeUTF8(byte[] utf8) throws UnsupportedEncodingException 
{
  Pattern p = Pattern.compile("\A(\n" +
    "  [\x09\x0A\x0D\x20-\x7E]             # ASCII\n" +
    "| [\xC2-\xDF][\x80-\xBF]               # non-overlong 2-byte\n" +
    "|  \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs\n" +
    "| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte\n" +
    "|  \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates\n" +
    "|  \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3\n" +
    "| [\xF1-\xF3][\x80-\xBF]{3}            # planes 4-15\n" +
    "|  \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16\n" +
    ")*\z", Pattern.COMMENTS);

  String phonyString = new String(utf8, "ISO-8859-1");
  return p.matcher(phonyString).matches();
}

As originally written, the regex is meant to be used on a byte array, but you can't do that with Java's regexes; the target has to be something that implements the CharSequence interface (so a char[]is out, too). By decoding the byte[]as ISO-8859-1, you create a String in which each charhas the same unsigned numeric value as the corresponding byte in the original array.

正如最初编写的那样，正则表达式旨在用于字节数组，但您不能用 Java 的正则表达式做到这一点；目标必须是实现 CharSequence 接口的东西（所以 achar[]也出来了）。通过将解码byte[]为 ISO-8859-1，您可以创建一个字符串，其中每个字符串都char具有与原始数组中相应字节相同的无符号数值。

As others have pointed out, tests like this can only tell you the byte[]couldcontain UTF-8 text, not that it does. But the regex is so exhaustive, it seems extremely unlikely that raw binary data could slip past it. Even an array of all zeroes wouldn't match, since the regex never matches NUL. If the only possibilities are UTF-8 and binary, I'd be willing to trust this test.

正如其他人指出的那样，这样的测试只能告诉您byte[]可能包含 UTF-8 文本，而不是它确实包含 UTF-8 文本。但是正则表达式是如此详尽，原始二进制数据似乎极不可能绕过它。即使是全零数组也不匹配，因为正则表达式永远不会匹配NUL. 如果唯一的可能性是 UTF-8 和二进制，我愿意相信这个测试。

And while you're at it, you could strip the UTF-8 BOM if there is one; otherwise, the UTF-8 CharsetDecoder will pass it through as if it were text.

当你在做的时候，你可以去掉 UTF-8 BOM（如果有的话）；否则，UTF-8 CharsetDecoder 会像文本一样传递它。

UTF-16 would be much more difficult, because there are very few byte sequences that are alwaysinvalid. The only ones I can think of offhand are high-surrogate characters that are missing their low-surrogate companions, or vice versa. Beyond that, you would need some context to decide whether a given sequence is valid. You might have a Cyrillic letter followed by a Chinese ideogram followed by a smiley-face dingbat, but it would be perfectly valid UTF-16.

UTF-16 会困难得多，因为总是无效的字节序列很少。我能想到的唯一一个是缺少低代理伙伴的高代理角色，反之亦然。除此之外，您还需要一些上下文来确定给定的序列是否有效。您可能有一个西里尔字母，后跟一个中文表意文字，后跟一个笑脸 dingbat，但它是完全有效的 UTF-16。

Answer 3

回答by Stephen C

The question assumes that there is a fundamental difference between a string and binary data. While this is intuitively so, it is next to impossible to define precisely what that difference is.

该问题假设字符串和二进制数据之间存在根本区别。虽然这在直觉上是如此，但几乎不可能准确定义这种差异是什么。

A Java String is a sequence of 16 bit quantities that correspond to one of the (almost) 2**16 Unicode basic codepoints. But if you look at those 16 bit 'characters', each one could equally represent an integer, a pair of bytes, a pixel, and so on. The bit patterns don't have anything intrinsic about that says what they represent.

Java 字符串是一个 16 位数量的序列，对应于（几乎）2**16 个 Unicode 基本代码点之一。但是，如果您查看那些 16 位“字符”，您会发现每个字符都可以等同地表示一个整数、一对字节、一个像素等。位模式没有任何内在的东西来说明它们代表什么。

Now suppose that you rephrased your question as asking for a way to distinguish UTF-8 encoded TEXT from arbitrary binary data. Does this help? In theory no, because the bit patterns that encode any written text can also be a sequence of numbers. (It is hard to say what "arbitrary" really means here. Can you tell me how to test if a number is "arbitrary"?)

现在假设您将问题重新表述为要求一种将 UTF-8 编码的文本与任意二进制数据区分开来的方法。这有帮助吗？理论上不会，因为编码任何书面文本的位模式也可以是数字序列。（这里很难说“任意”是什么意思。你能告诉我如何测试一个数字是否是“任意”的吗？）

The best we can do here is the following:

我们在这里可以做的最好的是：

Test if the bytes are a valid UTF-8 encoding.
Test if the decoded 16-bit quantities are all legal, "assigned" UTF-8 code-points. (Some 16 bit quantities are illegal (e.g. 0xffff) and others are not currently assigned to correspond to any character.) But what if a text document really uses an unassigned codepoint?
Test if the Unicode codepoints belong to the "planes" that you expect based on the assumed language of the document. But what if you don't know what language to expect, or if a document that uses multiple languages?
Test is the sequences of codepoints look like words, sentences, or whatever. But what if we had some "binary data" that happened to include embedded text sequences?

测试字节是否是有效的 UTF-8 编码。
测试解码的 16 位数量是否都是合法的、“分配的”UTF-8 代码点。（一些 16 位数量是非法的（例如 0xffff），而其他的当前没有分配到对应于任何字符。）但是如果文本文档真的使用了一个未分配的代码点呢？
根据文档的假定语言测试 Unicode 代码点是否属于您期望的“平面”。但是，如果您不知道期望使用哪种语言，或者文档使用多种语言怎么办？
测试是代码点序列看起来像单词、句子或其他任何东西。但是，如果我们有一些碰巧包含嵌入文本序列的“二进制数据”呢？

In summary, you can tell that a byte sequence is definitely not UTF-8 if the decode fails. Beyond that, if you make assumptions about language, you can say that a byte sequence is probablyor probably nota UTF-8 encoded text document.

总之，如果解码失败，您可以判断字节序列绝对不是 UTF-8。除此之外，如果您对语言做出假设，您可以说字节序列可能是或可能不是UTF-8 编码的文本文档。

IMO, the best thing you can do is to avoid getting into a situation where you program needs to make this decision. And if cannot avoid it, recognize that your program may get it wrong. With thought and hard work, you can make that unlikely, but the probability will never be zero.

IMO，你能做的最好的事情就是避免陷入你的程序需要做出这个决定的情况。如果无法避免，请认识到您的程序可能会出错。通过深思熟虑和努力工作，您可以避免这种情况发生，但概率永远不会为零。

Answer 4

回答by JamisonMan111

In the original question: How can I check whether a byte array contains a Unicode string in Java?; I found that the term Java Unicode is essentially referring to Utf16 Code Units. I went through this problem myself and created some code that could help anyone with this type of question on their mind find some answers.

在原始问题中：如何在 Java 中检查字节数组是否包含 Unicode 字符串？我发现术语 Java Unicode 本质上是指 Utf16 代码单元。我自己解决了这个问题并创建了一些代码，可以帮助任何有这类问题的人找到一些答案。

I have created 2 main methods, one will display Utf-8 Code Units and the other will create Utf-16 Code Units. Utf-16 Code Units is what you will encounter with Java and JavaScript...commonly seen in the form "\ud83d"

我创建了 2 个主要方法，一个将显示 Utf-8 代码单元，另一个将创建 Utf-16 代码单元。Utf-16 代码单元是您在 Java 和 JavaScript 中会遇到的……通常以“\ud83d”形式出现

For more help with Code Units and conversion try the website;

如需有关代码单位和转换的更多帮助，请访问网站；

https://r12a.github.io/apps/conversion/

Here is code...

这是代码...

    byte[] array_bytes = text.toString().getBytes();
    char[] array_chars = text.toString().toCharArray();
    System.out.println();
    byteArrayToUtf8CodeUnits(array_bytes);
    System.out.println();
    charArrayToUtf16CodeUnits(array_chars);


public static void byteArrayToUtf8CodeUnits(byte[] byte_array)
{
    /*for (int k = 0; k < array.length; k++)
    {
        System.out.println(name + "[" + k + "] = " + "0x" + byteToHex(array[k]));
    }*/
    System.out.println("array.length: = " + byte_array.length);
    //------------------------------------------------------------------------------------------
    for (int k = 0; k < byte_array.length; k++)
    {
        System.out.println("array byte: " + "[" + k + "]" + " converted to hex" + " = " + byteToHex(byte_array[k]));
    }
    //------------------------------------------------------------------------------------------
}
public static void charArrayToUtf16CodeUnits(char[] char_array)
{
    /*Utf16 code units are also known as Java Unicode*/
    System.out.println("array.length: = " + char_array.length);
    //------------------------------------------------------------------------------------------
    for (int i = 0; i < char_array.length; i++)
    {
        System.out.println("array char: " + "[" + i + "]" + " converted to hex" + " = " + charToHex(char_array[i]));
    }
    //------------------------------------------------------------------------------------------
}
static public String byteToHex(byte b)
{
    //Returns hex String representation of byte b
    char hexDigit[] =
            {
                    '0', '1', '2', '3', '4', '5', '6', '7',
                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
            };
    char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
    return new String(array);
}
static public String charToHex(char c)
{
    //Returns hex String representation of char c
    byte hi = (byte) (c >>> 8);
    byte lo = (byte) (c & 0xff);

    return byteToHex(hi) + byteToHex(lo);
}

Answer 5

回答by Daniel Fortunov

If the byte array begins with a Byte Order Mark(BOM) then it will be easy to distinguish what encoding has been used. The standard Java classes for processing text streams will probably deal with this for you automatically.

如果字节数组以字节顺序标记(BOM)开头，则很容易区分使用的编码。用于处理文本流的标准 Java 类可能会自动为您处理这个问题。

If you do not have a BOM in your byte data this will be substantially more difficult — .NET classes can perform statistical analysis to try and work out the encoding, but I think this is on the assumption that you know that you are dealing with text data (just don't know which encoding was used).

如果你的字节数据中没有 BOM，这将变得更加困难——.NET 类可以执行统计分析来尝试计算编码，但我认为这是基于你知道你正在处理文本的假设数据（只是不知道使用了哪种编码）。

If you have any control over the format for your input data your best choice would be to ensure that it contains a Byte Order Mark.

如果您可以控制输入数据的格式，最好的选择是确保它包含字节顺序标记。

Answer 6

回答by Thorbj?rn Ravn Andersen

Try decoding it. If you do not get any errors, then it is a valid UTF-8 string.

尝试解码它。如果您没有收到任何错误，则它是一个有效的 UTF-8 字符串。

Answer 7

回答by Mubashar

I think Michael has explained it well in his answerthis may be the only way to find out if a byte array contains all valid utf-8 sequences. I am using following code in php

我认为迈克尔在他的回答中已经很好地解释了这可能是找出字节数组是否包含所有有效 utf-8 序列的唯一方法。我在 php 中使用以下代码

function is_utf8($string) {

    return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs', $string);

}

Taken it from W3.org

取自W3.org

如何检查字节数组是否包含 Java 中的 Unicode 字符串？

提问by Iain

回答by Michael Borgwardt

回答by Alan Moore

回答by Stephen C

回答by JamisonMan111

回答by Daniel Fortunov

回答by Thorbj?rn Ravn Andersen

回答by Mubashar

相关推荐

最近更新

标签

如何检查字节数组是否包含 Java 中的 Unicode 字符串？

提问by Iain

回答by Michael Borgwardt

回答by Alan Moore

回答by Stephen C

回答by JamisonMan111

回答by Daniel Fortunov

回答by Thorbj?rn Ravn Andersen

回答by Mubashar

相关推荐

java 在 JScrollPane 上设置滚动条

java.lang.String.replace 问题的提示？

java 使用 JNI 从本机方法返回 null

使用 java UrlConnection 对 ntlm（或 kerberos）进行身份验证

相关推荐

最近更新

标签