在 C# 中将字符串存储为 UTF8

Question

提问by PhantomDrummer

I'm doing a lot of string manipulation in C#, and really need the strings to be stored one byte per character. This is because I need gigabytes of text simultaneously in memory and it's causing low memory issues. I know for certain that this text will never contain non-ASCII characters, so for my purposes, the fact that System.String and System.Char store everything as two bytes per character is both unnecessary and a real problem.

我在 C# 中进行了很多字符串操作，并且确实需要将字符串存储为每个字符一个字节。这是因为我需要在内存中同时存储数 GB 的文本，这会导致内存不足的问题。我确信这个文本永远不会包含非 ASCII 字符，因此就我而言，System.String 和 System.Char 将所有内容存储为每个字符两个字节的事实既不必要又是一个真正的问题。

I'm about to start coding my own CharAscii and StringAscii classes - the string one will basically hold its data as byte[], and expose string manipulation methods similar to the ones that System.String does. However this seems a lot of work to do something that seems like a very standard problem, so I'm really posting here to check that there isn't already an easier solution. Is there for example some way I can make System.String internally store data as UTF8 that I haven't noticed, or some other way round the problem?

我即将开始编写我自己的 CharAscii 和 StringAscii 类 - 字符串基本上将其数据保存为 byte[]，并公开类似于 System.String 所做的字符串操作方法。然而，这似乎需要做很多工作来做一些看起来非常标准的问题，所以我真的在这里发帖以检查是否还没有更简单的解决方案。例如，是否有某种方法可以让 System.String 在内部将数据存储为我没有注意到的 UTF8，或者以其他方式解决问题？

Answer 1

采纳答案by Chris

As you've found, the CLR uses UTF-16 for character encoding. Your best bet may be to use the Encoding classes & a BitConverter to handle the text. This question has some good examples for converting between the two encodings:

如您所见，CLR 使用 UTF-16 进行字符编码。您最好的选择可能是使用编码类和 BitConverter 来处理文本。这个问题有一些很好的例子可以在两种编码之间进行转换：

Convert String (UTF-16) to UTF-8 in C#

在 C# 中将字符串 (UTF-16) 转换为 UTF-8

Answer 2

回答by KeithS

Well, you could create a wrapper that retrieves the data as UTF-8 bytes and converts pieces as needed to System.String, then vice-versa to push the string back out to memory. The Encoding class will help you out here:

好吧，您可以创建一个包装器，将数据作为 UTF-8 字节检索并根据需要将片段转换为 System.String，反之亦然，将字符串推回内存。Encoding 类将在这里为您提供帮助：

var utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(myString);

var myReturnedString = utf8.GetString(utfBytes);

Answer 3

回答by Jon Hanna

Not really. System.Stringis designed for storing strings. Your requirement is for a very particular subset of strings with particular memory benefits.

并不真地。System.String专为存储字符串而设计。您的要求是具有特定内存优势的非常特殊的字符串子集。

Now, "very particular subset of strings with particular memory benefits" comes up a lot, but not always the same very particular subset. Code that is ASCII-only isn't for reading by human beings, so it tends to be either short codes, or something that can be handled in a stream-processing manner, or else chunks of text merged in with bytes doing other jobs (e.g. quite a few binary formats will have small bits that translate directly to ASCII).

现在，“具有特定内存优势的非常特殊的字符串子集”出现了很多，但并不总是相同的非常特殊的子集。仅 ASCII 的代码不适合人类阅读，因此它往往是短代码，或者可以以流处理方式处理的东西，或者与其他工作的字节合并的文本块（例如，相当多的二进制格式将具有直接转换为 ASCII 的小位）。

As such, you've a pretty strange requirement.

因此，您有一个非常奇怪的要求。

All the more so when you come to the gigabytes part. If I'm dealing with gigs, I'm immediately thinking about how I can stop having to deal with gigs, and/or get much more serious savings than just 50%. I'd be thinking about mapping chunks I'm not currently interested in to a file, or about ropes, or about a bunch of other things. Of course, those are going to work for some cases and not for all, so yet again, we're not talking about something where .NET should stick in something as a one-size-fits-all, because one size will not fit all.

当你谈到千兆字节时更是如此。如果我正在处理零工，我会立即考虑如何才能不再需要处理零工，和/或获得比仅 50% 更可观的节省。我会考虑将我目前不感兴趣的块映射到文件，或者绳索，或者一堆其他东西。当然，这些将适用于某些情况而不是所有情况，所以再说一次，我们不是在谈论 .NET 应该坚持一刀切的东西，因为一种尺寸不适合全部。

Beyond that, just the utf-8 bit isn't that hard. It's all the other methods that becomes work. Again, what you need there won't be the same as someone else.

除此之外，只有 utf-8 位并不难。所有其他方法都可以发挥作用。同样，您需要的东西不会与其他人相同。

Answer 4

回答by Thanatos

As I can see your problem is that char in C# is occupying 2 bytes, instead of one.

正如我所看到的，您的问题是 C# 中的 char 占用了 2 个字节，而不是一个。

One way to read a text file is to open it with :

读取文本文件的一种方法是使用以下命令打开它：

    System.IO.FileStream fs = new System.IO.FileStream(file, System.IO.FileMode.Open);
    System.IO.BinaryReader br = new System.IO.BinaryReader(fs);

    byte[] buffer = new byte[1024];
    int read = br.Read(buffer, 0, (int)fs.Length);

    br.Close();
    fs.Close();

And this way you are reading the bytes from the file. I tried it with *.txt files encoded in UTF-8that is 2 bytes per char, and ANSIthat is 1 byte per char.

这样您就可以从文件中读取字节。我在编码* .txt文件尝试了UTF-8即每字符的2字节，以及ANSI即每字符1个字节。

在 C# 中将字符串存储为 UTF8

提问by PhantomDrummer

采纳答案by Chris

回答by KeithS

回答by Jon Hanna

回答by Thanatos

相关推荐

最近更新

标签

在 C# 中将字符串存储为 UTF8

提问by PhantomDrummer

采纳答案by Chris

回答by KeithS

回答by Jon Hanna

回答by Thanatos

相关推荐

C# EF - 在 HTTP 请求期间创建模型异常时无法使用上下文

C# MVC3 如何检查 HttpPostedFileBase 是否是图像

C# 为远程托管的 SQL Server 定义 connectionString

如何通过单击按钮打开用户控件 C#

相关推荐

最近更新

标签