Java 将 byte[] 编码为 String

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19894723/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 21:07:35  来源:igfitidea点击:

Encode byte[] as String

javaencodingutf-8character-encodingbyte

提问by maxammann

Heyho,

嘿嘿,

I want to convert byte data, which can be anything, to a String. My question is, whether it is "secure" to encode the byte data with UTF-8 for example:

我想将字节数据(可以是任何内容)转换为字符串。我的问题是,使用 UTF-8 编码字节数据是否“安全”,例如:

String s1 = new String(data, "UTF-8");

or by using base64:

或使用 base64:

String s2 = Base64.encodeToString(data, false); //migbase64

I'm just afraid that using the first method has negative side effects. I mean both variants work p?e?r?f?e?c?t?l?y? , but s1can contain any character of the UTF-8 charset, s2only uses "readable" characters. I'm just not sure if it's really need to use base64. Basically I just need to create a String send it over the network and receive it again. (There is no other way in my situation :/)

我只是担心使用第一种方法会产生负面影响。我的意思是两种变体都可以工作 p?e?r?f?e?c?t?l?y? ,但s1可以包含 UTF-8 字符集的任何字符,s2只使用“可读”字符。我只是不确定是否真的需要使用 base64。基本上我只需要创建一个字符串通过网络发送它并再次接收它。(在我的情况下没有其他方法:/)

The question is only about negative side effects, not if it's possible!

问题仅在于负面影响,而不是是否可能!

采纳答案by Jon Skeet

You should absolutelyuse base64 or possibly hex. (Either will work; base64 is more compact but harder for humans to read.)

绝对应该使用 base64 或可能的十六进制。(两者都行;base64 更紧凑,但人类更难阅读。)

You claim "both variants work perfectly" but that's actually not true. If you use the first approach and datais not actually a valid UTF-8 sequence, you will lose data. You're nottrying to convert UTF-8-encoded text into a String, so don't write code which tries to do so.

您声称“两种变体都可以完美运行”,但实际上并非如此。如果您使用第一种方法并且data实际上不是有效的 UTF-8 序列,您将丢失数据。您不是要尝试将 UTF-8 编码的文本转换为String,因此不要编写试图这样做的代码。

Using ISO-8859-1as an encoding will preserve all the data - but in very many cases the string that is returned will not be easily transported across other protocols. It may very well contain unprintable control characters, for example.

使用ISO-8859-1作为编码将保留所有的数据-但在很多情况下,返回的字符串不会轻易在其他协议传输。例如,它很可能包含不可打印的控制字符。

Only use the String(byte[], String)constructor when you've got inherently textualdata, which you happen to have in an encoded form (where the encoding is specified as the second argument). For anything else - music, video, images, encrypted or compressed data, just for example - you should use an approach which treats the incoming data as "arbitrary binary data" and finds a textual encoding of it... which is precisely what base64 and hex do.

String(byte[], String)当您拥有固有的文本数据时才使用构造函数,这些数据恰好以编码形式存在(其中将编码指定为第二个参数)。对于其他任何东西 - 音乐、视频、图像、加密或压缩数据,例如 - 您应该使用一种将传入数据视为“任意二进制数据”并找到它的文本编码的方法......这正是 base64和十六进制做。

回答by Peter Lawrey

You can store a byte in a String, though it's not a good idea. You can't use UTF-8 as this will mange the bytes but a faster and more efficient way is to use ISO-8859-1 encoding or plain 8-bit. The simplest way to do this is to use

您可以在字符串中存储一个字节,尽管这不是一个好主意。您不能使用 UTF-8,因为这会管理字节,但更快、更有效的方法是使用 ISO-8859-1 编码或纯 8 位。最简单的方法是使用

String s1 = new String(data, 0);

or

或者

String s1 = new String(data, "ISO-8859-1");

From UTF-8 on Wikipedia, As Jon Skeet notes, these encodings are not valid under the standard. Their behaviour in Java varies. DataInputStream treats them as the same for the first three version and the next two throw an exception. The Charset decoder treats them as separate characters silently.

维基百科上的 UTF-8 开始,正如 Jon Skeet 所指出的,这些编码在标准下是无效的。它们在 Java 中的行为各不相同。DataInputStream 将它们视为前三个版本相同,接下来的两个版本抛出异常。Charset 解码器将它们默默地视为单独的字符。

00000000 is 
@Test
public void testBase64() {
    final byte[] original = enumerate();
    final String encoded = Base64.encodeBase64String( original );
    final byte[] decoded = Base64.decodeBase64( encoded );
    assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testIso8859() {
    final byte[] original = enumerate();
    String s = new String( original, StandardCharsets.ISO_8859_1 );
    final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 );
    assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testUtf16() {
    final byte[] original = enumerate();
    String s = new String( original, StandardCharsets.UTF_16 );
    final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 );
    assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testUtf8() {
    final byte[] original = enumerate();
    String s = new String( original, StandardCharsets.UTF_8 );
    final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 );
    assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) );
}

@Test
public void testEnumerate() {
    final Set<Byte> byteSet = new HashSet<>();
    final byte[] bytes = enumerate();
    for ( byte b : bytes ) {
        byteSet.add( b );
    }
    assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() );
}

/**
 * Enumerates all the byte values.
 */
private byte[] enumerate() {
    final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
    final byte[] bytes = new byte[length];
    for ( int i = 0; i < length; i++ ) {
        bytes[i] = (byte)(i + Byte.MIN_VALUE);
    }
    return bytes;
}
11000000 10000000 is ##代码## 11100000 10000000 10000000 is ##代码## 11110000 10000000 10000000 10000000 is ##代码## 11111000 10000000 10000000 10000000 10000000 is ##代码## 11111100 10000000 10000000 10000000 10000000 10000000 is ##代码##

This means if you see \0 in you String, you have no way of knowing for sure what the original byte[] values were. DataOutputStream uses the second option for compatibility with C which sees \0 as a terminator.

这意味着如果您在 String 中看到 \0,您将无法确定原始 byte[] 值是什么。DataOutputStream 使用第二个选项来与 C 兼容,后者将 \0 视为终止符。

BTW DataOutputStream is not aware of code points so writes high code point characters in UTF-16 and then UTF-8 encoding.

顺便说一句,DataOutputStream 不知道代码点,因此以 UTF-16 和 UTF-8 编码编写高代码点字符。

0xFE and 0xFF are not valid to appear in a character. Values 0x11000000+ can only appear at the start of a character, not inside a multi-byte character.

0xFE 和 0xFF 不能出现在字符中。值 0x11000000+ 只能出现在字符的开头,不能出现在多字节字符内。

回答by neurite

Confirmed the accepted answer with Java. To repeat, UTF-8, UTF-16 do not preserve all the byte values. ISO-8859-1 does preserve all the byte values. But if the encoded bytes is to be transported beyond the JVM, use Base64.

用 Java 确认接受的答案。重复一遍,UTF-8、UTF-16 不会保留所有字节值。ISO-8859-1 确实保留了所有字节值。但是如果编码的字节要传输到 JVM 之外,请使用 Base64。

##代码##