java 实现一个函数来检查字符串/字节数组是否遵循 utf-8 格式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28890907/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Implement a function to check if a string/byte array follows utf-8 format
提问by DoraShine
I am trying to solve this interview question.
我正在尝试解决这个面试问题。
After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.
在给出明确的UTF-8格式定义之后。ex: 1-byte : 0b0xxxxxxx 2- bytes:.... 要求编写一个函数来验证输入是否为有效的 UTF-8。输入将是字符串/字节数组,输出应该是是/否。
I have two possible approaches.
我有两种可能的方法。
First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).
首先,如果输入是一个字符串,由于UTF-8最多4个字节,我们去掉前两个字符“0b”后,我们可以使用Integer.parseInt(s)来检查字符串的其余部分是否在范围 0 到 10FFFF。此外,最好先检查字符串的长度是否是 8 的倍数,以及输入的字符串是否包含全 0 和 1。所以我将不得不通过字符串两次,复杂度将是 O(n)。
Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.
其次,如果输入是字节数组(如果输入是字符串,我们也可以使用此方法),我们检查每个 1 字节元素是否在正确的范围内。如果输入的是字符串,首先检查字符串的长度是8的倍数,然后检查每个8字符的子字符串是否在范围内。
I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.
我知道有几种关于如何使用 Java 库检查字符串的解决方案,但我的问题是我应该如何根据问题实现该功能。
Thanks a lot.
非常感谢。
采纳答案by DoraShine
Well, I am grateful for the comments and the answer. First of all, I have to agree that this is "another stupid interview question". It is true that in Java String is already encoded, so it will always be compatible with UTF-8. One way to check it is given a string:
好吧,我很感激评论和答案。首先,我必须同意这是“另一个愚蠢的面试问题”。诚然,在 Java 中,字符串已经被编码,因此它将始终与 UTF-8 兼容。检查它的一种方法是给定一个字符串:
public static boolean isUTF8(String s){
try{
byte[]bytes = s.getBytes("UTF-8");
}catch(UnsupportedEncodingException e){
e.printStackTrace();
System.exit(-1);
}
return true;
}
However, since all the printable strings are in the unicode form, so I haven't got a chance to get an error.
但是,由于所有可打印的字符串都是 unicode 形式,所以我没有机会出错。
Second, if given a byte array, it will always be in the range -2^7(0b10000000) to 2^7(0b1111111), so it will always be in a valid UTF-8 range.
其次,如果给定一个字节数组,它总是在 -2^7(0b10000000) 到 2^7(0b1111111) 的范围内,所以它总是在有效的 UTF-8 范围内。
My initial understanding to the question was that given a string, say "0b11111111", check if it is a valid UTF-8, I guess I was wrong.
我对这个问题的初步理解是,给定一个字符串,比如“0b11111111”,检查它是否是有效的 UTF-8,我想我错了。
Moreover, Java does provide constructor to convert byte array to string, and if you are interested in the decode method, check here.
此外,Java 确实提供了将字节数组转换为字符串的构造函数,如果您对 decode 方法感兴趣,请查看此处。
One more thing, the above answer would be correct given another language. The only improvement could be:
还有一件事,鉴于另一种语言,上述答案是正确的。唯一的改进可能是:
In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.
2003 年 11 月,UTF-8 被 RFC 3629 限制为以 U+10FFFF 结尾,以匹配 UTF-16 字符编码的约束。这删除了所有 5 和 6 字节序列,以及大约一半的 4 字节序列。
So 4 bytes would be enough.
所以4个字节就足够了。
I am definitely to this, so correct me if I am wrong. Thanks a lot.
我绝对是这个,所以如果我错了,请纠正我。非常感谢。
回答by Jean-Fran?ois Savard
Let's first have a look at a visual representation of the UTF-8 design.
Now let's resume what we have to do.
现在让我们继续我们必须做的事情。
- Loop over all character of the string (each character being a byte).
- We will need to apply a mask to each byte depending on the codepoint as the
x
characters represent the actual codepoint. We will use the binary AND operator (&
) which copy a bit to the result if it exists in both operands. - The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. We will do the bitwise operation using
0b1xxxxxxx
where 1 will appear "Bytes in sequence" time, and other bits will be 0. - We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte.
- If the character entered in none of the case, it means the byte is invalid and we return "No".
- If we can get out of the loop, that means each of the character are valid, hence the string is valid.
- Make sure the comparison that returned true correspond to the expected length.
- 循环遍历字符串的所有字符(每个字符都是一个字节)。
- 我们需要根据代码点对每个字节应用掩码,因为
x
字符代表实际的代码点。我们将使用二元 AND 运算符 (&
),如果它在两个操作数中都存在,它会将一个位复制到结果中。 - 应用掩码的目的是去除尾随位,因此我们将实际字节作为第一个代码点进行比较。我们将使用
0b1xxxxxxx
其中 1 将出现“按顺序排列的字节”时间进行按位运算,其他位将为 0。 - 然后我们可以与第一个字节进行比较以验证它是否有效,并确定实际字节是什么。
- 如果输入的字符都没有大小写,则表示该字节无效,我们返回“No”。
- 如果我们可以跳出循环,则意味着每个字符都是有效的,因此字符串也是有效的。
- 确保返回 true 的比较对应于预期的长度。
The method would look like this :
该方法如下所示:
public static final boolean isUTF8(final byte[] pText) {
int expectedLength = 0;
for (int i = 0; i < pText.length; i++) {
if ((pText[i] & 0b10000000) == 0b00000000) {
expectedLength = 1;
} else if ((pText[i] & 0b11100000) == 0b11000000) {
expectedLength = 2;
} else if ((pText[i] & 0b11110000) == 0b11100000) {
expectedLength = 3;
} else if ((pText[i] & 0b11111000) == 0b11110000) {
expectedLength = 4;
} else if ((pText[i] & 0b11111100) == 0b11111000) {
expectedLength = 5;
} else if ((pText[i] & 0b11111110) == 0b11111100) {
expectedLength = 6;
} else {
return false;
}
while (--expectedLength > 0) {
if (++i >= pText.length) {
return false;
}
if ((pText[i] & 0b11000000) != 0b10000000) {
return false;
}
}
}
return true;
}
Edit :The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.
回答by Thiago Mata
A small solution for real world UTF-8 compatibility checking:
现实世界 UTF-8 兼容性检查的一个小解决方案:
public static final boolean isUTF8(final byte[] inputBytes) {
final String converted = new String(inputBytes, StandardCharsets.UTF_8);
final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
return Arrays.equals(inputBytes, outputBytes);
}
You can check the tests results:
您可以查看测试结果:
@Test
public void testEnconding() {
byte[] invalidUTF8Bytes1 = new byte[]{(byte)0b10001111, (byte)0b10111111 };
byte[] invalidUTF8Bytes2 = new byte[]{(byte)0b10101010, (byte)0b00111111 };
byte[] validUTF8Bytes1 = new byte[]{(byte)0b11001111, (byte)0b10111111 };
byte[] validUTF8Bytes2 = new byte[]{(byte)0b11101111, (byte)0b10101010, (byte)0b10111111 };
assertThat(isUTF8(invalidUTF8Bytes1)).isFalse();
assertThat(isUTF8(invalidUTF8Bytes2)).isFalse();
assertThat(isUTF8(validUTF8Bytes1)).isTrue();
assertThat(isUTF8(validUTF8Bytes2)).isTrue();
assertThat(isUTF8("\u24b6".getBytes(StandardCharsets.UTF_8))).isTrue();
}
Test cases copy from https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array
测试用例复制自https://codereview.stackexchange.com/questions/59428/validating-utf-8-byte-array
回答by Koray Tugay
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
回答by benez
the CharsetDecoder
might be what you are looking for:
这CharsetDecoder
可能是您正在寻找的:
@Test
public void testUTF8() throws CharacterCodingException {
// the desired charset
final Charset UTF8 = Charset.forName("UTF-8");
// prepare decoder
final CharsetDecoder decoder = UTF8.newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
byte[] bytes = new byte[48];
new Random().nextBytes(bytes);
ByteBuffer buffer = ByteBuffer.wrap(bytes);
try {
decoder.decode(buffer);
fail("Should not be UTF-8");
} catch (final CharacterCodingException e) {
// noop, the test should fail here
}
final String string = "hallo welt!";
bytes = string.getBytes(UTF8);
buffer = ByteBuffer.wrap(bytes);
final String result = decoder.decode(buffer).toString();
assertEquals(string, result);
}
so your function might look like that:
所以你的函数可能是这样的:
public static boolean checkEncoding(final byte[] bytes, final String encoding) {
final CharsetDecoder decoder = Charset.forName(encoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
final ByteBuffer buffer = ByteBuffer.wrap(bytes);
try {
decoder.decode(buffer);
return true;
} catch (final CharacterCodingException e) {
return false;
}
}