如何在不手动指定编码的情况下在 C# 中获得一致的字符串字节表示?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/472906/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 04:48:02  来源:igfitidea点击:

How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

c#.netstringcharacter-encoding

提问by Agnel Kurian

How do I convert a stringto a byte[]in .NET (C#) without manually specifying a specific encoding?

如何在不手动指定特定编码的情况下将.NET (C#) 中的stringa转换为 a byte[]

I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.

我要加密字符串。我可以在不转换的情况下对其进行加密,但我仍然想知道为什么编码会在这里起作用。

Also, why should encoding even be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

另外,为什么还要考虑编码?我不能简单地获取字符串存储在哪些字节中?为什么对字符编码有依赖性?

采纳答案by user541686

Contrary to the answers here, you DON'T need to worry about encoding ifthe bytes don't need to be interpreted!

与此处的答案相反,如果不需要解释字节,则无需担心编码!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)

就像您提到的那样,您的目标很简单,就是“获取字符串存储在哪些字节中”
(当然,为了能够从字节重新构造字符串。)

For those goals, I honestly do notunderstand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

对于这些目标,老实说,我明白为什么人们一直告诉您您需要编码。您当然不需要为此担心编码。

Just do this instead:

只需这样做:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

As long as your program (or other programs) don't try to interpretthe bytes somehow, which you obviously didn't mention you intend to do, then there is nothingwrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

只要您的程序(或其他程序)不尝试以某种方式解释字节,您显然没有提到您打算这样做,那么这种方法就没有错!担心编码只会让你的生活无缘无故变得更加复杂。

Additional benefit to this approach:

这种方法的额外好处:

It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

字符串是否包含无效字符并不重要,因为您仍然可以获取数据并重建原始字符串!

It will be encoded and decoded just the same, because you are just looking at the bytes.

它将以相同的方式进行编码和解码,因为您只是在查看 bytes

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

但是,如果您使用特定的编码,它会给您编码/解码无效字符的麻烦。

回答by gkrogers

byte[] strToByteArray(string str)
{
    System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
    return enc.GetBytes(str);
}

回答by bmotmans

It depends on the encoding of your string (ASCII, UTF-8, ...).

这取决于您的字符串的编码(ASCIIUTF-8,...)。

For example:

例如:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

A small sample why encoding matters:

为什么编码很重要的一个小例子:

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII simply isn't equipped to deal with special characters.

ASCII 根本不具备处理特殊字符的能力。

Internally, the .NET framework uses UTF-16to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).

在内部,.NET 框架使用UTF-16来表示字符串,因此如果您只想获取 .NET 使用的确切字节,请使用System.Text.Encoding.Unicode.GetBytes (...).

See Character Encoding in the .NET Framework(MSDN) for more information.

有关详细信息,请参阅.NET Framework(MSDN)中的字符编码

回答by cyberbobcat

// C# to convert a string to a byte array.
public static byte[] StrToByteArray(string str)
{
    System.Text.ASCIIEncoding  encoding=new System.Text.ASCIIEncoding();
    return encoding.GetBytes(str);
}


// C# to convert a byte array to a string.
byte [] dBytes = ...
string str;
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
str = enc.GetString(dBytes);

回答by Zhaph - Ben Duguid

You need to take the encoding into account, because 1 character could be represented by 1 or morebytes (up to about 6), and different encodings will treat these bytes differently.

您需要考虑编码,因为 1 个字符可以由 1 个或更多字节(最多约 6 个)表示,不同的编码会对这些字节进行不同的处理。

Joel has a posting on this:

乔尔对此发表了一篇文章:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)

回答by Hans Passant

The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.

关键问题是字符串中的字形需要 32 位(字符代码为 16 位),但一个字节只有 8 位可以空闲。除非您将自己限制为仅包含 ASCII 字符的字符串,否则不存在一对一映射。System.Text.Encoding 有很多方法可以将字符串映射到 byte[],您需要选择一种可以避免信息丢失并且在客户需要将 byte[] 映射回字符串时易于使用的方法.

Utf8 is a popular encoding, it is compact and not lossy.

utf8 是一种流行的编码,它紧凑且无损。

回答by Ed Marty

I'm not sure, but I think the string stores its info as an array of Chars, which is inefficient with bytes. Specifically, the definition of a Char is "Represents a Unicode character".

我不确定,但我认为该字符串将其信息存储为一个字符数组,这对于字节来说是低效的。具体来说,一个 Char 的定义是“代表一个 Unicode 字符”。

take this example sample:

以这个例子为例:

String str = "asdf é?";
String str2 = "asdf gh";
EncodingInfo[] info =  Encoding.GetEncodings();
foreach (EncodingInfo enc in info)
{
    System.Console.WriteLine(enc.Name + " - " 
      + enc.GetEncoding().GetByteCount(str)
      + enc.GetEncoding().GetByteCount(str2));
}

Take note that the Unicode answer is 14 bytes in both instances, whereas the UTF-8 answer is only 9 bytes for the first, and only 7 for the second.

请注意,Unicode 答案在两种情况下都是 14 个字节,而 UTF-8 答案第一个只有 9 个字节,第二个只有 7 个。

So if you just want the bytes used by the string, simply use Encoding.Unicode, but it will be inefficient with storage space.

因此,如果您只想要字符串使用的字节,只需使用Encoding.Unicode,但存储空间效率低下。

回答by Joel Coehoorn

The first part of your question (how to get the bytes) was already answered by others: look in the System.Text.Encodingnamespace.

其他人已经回答了您问题的第一部分(如何获取字节):查看System.Text.Encoding命名空间。

I will address your follow-up question: why do you need to pick an encoding? Why can't you get that from the string class itself?

我将解决您的后续问题:为什么需要选择编码?为什么你不能从字符串类本身得到它?

The answer is in two parts.

答案分为两部分。

First of all, the bytes used internally by the string class don't matter, and whenever you assume they do you're likely introducing a bug.

首先,字符串类内部使用的字节无关紧要,并且无论何时假设它们这样做,您都可能会引入错误。

If your program is entirely within the .Net world then you don't need to worry about getting byte arrays for strings at all, even if you're sending data across a network. Instead, use .Net Serialization to worry about transmitting the data. You don't worry about the actual bytes any more: the Serialization formatter does it for you.

如果您的程序完全在 .Net 世界中,那么您根本不需要担心获取字符串的字节数组,即使您通过网络发送数据。相反,使用 .Net Serialization 来担心传输数据。您不再担心实际字节:序列化格式化程序会为您完成。

On the other hand, what if you are sending these bytes somewhere that you can't guarantee will pull in data from a .Net serialized stream? In this case you definitely do need to worry about encoding, because obviously this external system cares. So again, the internal bytes used by the string don't matter: you need to pick an encoding so you can be explicit about this encoding on the receiving end, even if it's the same encoding used internally by .Net.

另一方面,如果您将这些字节发送到某个您无法保证会从 .Net 序列化流中提取数据的地方,该怎么办?在这种情况下,您肯定需要担心编码,因为显然这个外部系统关心。同样,字符串使用的内部字节无关紧要:您需要选择一种编码,以便您可以在接收端明确说明此编码,即使它与 .Net 内部使用的编码相同。

I understand that in this case you might prefer to use the actual bytes stored by the string variable in memory where possible, with the idea that it might save some work creating your byte stream. However, I put it to you it's just not important compared to making sure that your output is understood at the other end, and to guarantee that you mustbe explicit with your encoding. Additionally, if you really want to match your internal bytes, you can already just choose the Unicodeencoding, and get that performance savings.

我知道在这种情况下,您可能更喜欢在可能的情况下使用字符串变量存储在内存中的实际字节,这样可以节省一些创建字节流的工作。但是,我告诉您,与确保在另一端理解您的输出并确保您必须明确编码相比,这并不重要。此外,如果你真的想匹配你的内部字节,你已经可以选择Unicode编码,并获得性能节省。

Which brings me to the second part... picking the Unicodeencoding istelling .Net to use the underlying bytes. You do need to pick this encoding, because when some new-fangled Unicode-Plus comes out the .Net runtime needs to be free to use this newer, better encoding model without breaking your program. But, for the moment (and forseeable future), just choosing the Unicode encoding gives you what you want.

这让我进入第二部分......选择Unicode编码告诉 .Net 使用底层字节。您确实需要选择这种编码,因为当一些新奇的 Unicode-Plus 出现时,.Net 运行时需要自由使用这种更新、更好的编码模型,而不会破坏您的程序。但是,目前(以及可预见的未来),只需选择 Unicode 编码即可满足您的需求。

It's also important to understand your string has to be re-written to wire, and that involves at least some translation of the bit-pattern even when you use a matching encoding. The computer needs to account for things like Big vs Little Endian, network byte order, packetization, session information, etc.

了解您的字符串必须重新写入线也很重要,即使您使用匹配的 encoding,这也至少涉及位模式的一些转换。计算机需要考虑大端与小端、网络字节顺序、打包、会话信息等。

回答by Michael Buen

BinaryFormatter bf = new BinaryFormatter();
byte[] bytes;
MemoryStream ms = new MemoryStream();

string orig = "ι Hello лл Thank You";
bf.Serialize(ms, orig);
ms.Seek(0, 0);
bytes = ms.ToArray();

MessageBox.Show("Original bytes Length: " + bytes.Length.ToString());

MessageBox.Show("Original string Length: " + orig.Length.ToString());

for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo encrypt
for (int i = 0; i < bytes.Length; ++i) bytes[i] ^= 168; // pseudo decrypt

BinaryFormatter bfx = new BinaryFormatter();
MemoryStream msx = new MemoryStream();            
msx.Write(bytes, 0, bytes.Length);
msx.Seek(0, 0);
string sx = (string)bfx.Deserialize(msx);

MessageBox.Show("Still intact :" + sx);

MessageBox.Show("Deserialize string Length(still intact): " 
    + sx.Length.ToString());

BinaryFormatter bfy = new BinaryFormatter();
MemoryStream msy = new MemoryStream();
bfy.Serialize(msy, sx);
msy.Seek(0, 0);
byte[] bytesy = msy.ToArray();

MessageBox.Show("Deserialize bytes Length(still intact): " 
   + bytesy.Length.ToString());

回答by Michael Buen

Two ways:

两种方式:

public static byte[] StrToByteArray(this string s)
{
    List<byte> value = new List<byte>();
    foreach (char c in s.ToCharArray())
        value.Add(c.ToByte());
    return value.ToArray();
}

And,

和,

public static byte[] StrToByteArray(this string s)
{
    s = s.Replace(" ", string.Empty);
    byte[] buffer = new byte[s.Length / 2];
    for (int i = 0; i < s.Length; i += 2)
        buffer[i / 2] = (byte)Convert.ToByte(s.Substring(i, 2), 16);
    return buffer;
}

I tend to use the bottom one more often than the top, haven't benchmarked them for speed.

我倾向于使用底部的比顶部的更频繁,没有对它们进行速度基准测试。

回答by Konamiman

Also please explain why encoding should be taken into consideration. Can't I simply get what bytes the string has been stored in? Why this dependency on encoding?!!!

还请解释为什么应该考虑编码。我不能简单地获取字符串存储在哪些字节中?为什么这种对编码的依赖?!!!

Because there is no such thing as "the bytes of the string".

因为没有“字符串的字节”这样的东西。

A string (or more generically, a text) is composed of characters: letters, digits, and other symbols. That's all. Computers, however, do not know anything about characters; they can only handle bytes. Therefore, if you want to store or transmit text by using a computer, you need to transform the characters to bytes. How do you do that? Here's where encodings come to the scene.

字符串(或更一般地,文本)由字符组成:字母、数字和其他符号。就这样。然而,计算机对字符一无所知。他们只能处理字节。因此,如果要使用计算机存储或传输文本,则需要将字符转换为字节。你是怎样做的?这就是编码出现的地方。

An encoding is nothing but a convention to translate logical characters to physical bytes. The simplest and best known encoding is ASCII, and it is all you need if you write in English. For other languages you will need more complete encodings, being any of the Unicode flavours the safest choice nowadays.

编码只不过是将逻辑字符转换为物理字节的约定。最简单和最著名的编码是 ASCII,如果你用英语写作,它就是你所需要的。对于其他语言,您将需要更完整的编码,现在任何 Unicode 风格都是最安全的选择。

So, in short, trying to "get the bytes of a string without using encodings" is as impossible as "writing a text without using any language".

因此,简而言之,尝试“不使用编码获取字符串的字节”与“不使用任何语言编写文本”一样不可能。

By the way, I strongly recommend you (and anyone, for that matter) to read this small piece of wisdom: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

顺便说一句,我强烈建议您(和任何人,就此而言)阅读这一小部分智慧:每个软件开发人员绝对、肯定必须了解的绝对最低要求(没有借口!)