C# 如何在 .NET 中从可能的 Windows 1252 'ANSI' 编码上传文件转换为 UTF8？

Question

提问by

I've got a FileUploadcontrol in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.

我FileUpload在 ASP.NET 网页中有一个控件，用于上传文件，其内容（在流中）在后面的 C# 代码中处理并稍后在页面上输出，使用HtmlEncode.

But, some of this output is becoming mangled, specifically the symbol '￡' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.

但是，这些输出中的一些正在变得混乱，特别是符号“￡”作为 Unicode FFFD REPLACEMENT CHARACTER 输出。我已经将其跟踪到输入文件，该文件是 Windows 1252 ('ANSI') 编码的。

The question is,

问题是，

How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol ￡ etc?

如何确定文件是编码为 1252 还是 UTF8？它可以是，并且
如果它在 Windows 1252 中，如何将其转换为 UTF8，保留符号￡等？

I've looked online but cannot find a satisfactory answer.

我在网上查过，但找不到满意的答案。

Answer 1

回答by Jim Mischel

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:

如果您知道该文件是使用 Windows 1252 编码的，则可以使用 StreamReader 打开该文件并传递正确的编码。那是：

StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);

The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.

“true”告诉它根据文件前面的字节顺序标记设置编码（如果它们在那里）。否则，它会以 Windows-1252 的形式打开它。

You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.

然后您可以读取该文件，如果您想转换为 UTF-8，请写入您使用该结束编码打开的文件。

The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.

对您的第一个问题的简短回答是，没有一种 100% 令人满意的方法来确定文件的编码。如果有字节顺序标记，您可以确定它是什么风格的 Unicode，但没有 BOM，您只能使用启发式方法来确定编码。

I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.

我没有很好的启发式参考。您可以搜索“记事本如何确定字符集”。我记得前段时间看到过这样的事情。

In practice, I've found the following to work for most of what I do:

在实践中，我发现以下内容适用于我所做的大部分工作：

StreamReader reader = new StreamReader("filename", Encoding.Default, true);

Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

我阅读的大多数文件都是我使用 .NET 的 StreamWriter 创建的文件，它们是带有 BOM 的 UTF-8。我得到的其他文件通常是用一些不理解 Unicode 或代码页的工具编写的，我只是把它当作一个字节流，Encoding.Default 做得很好。

C# 如何在 .NET 中从可能的 Windows 1252 'ANSI' 编码上传文件转换为 UTF8？

提问by

回答by Jim Mischel

相关推荐

最近更新

标签

C# 如何在 .NET 中从可能的 Windows 1252 'ANSI' 编码上传文件转换为 UTF8？

提问by

回答by Jim Mischel

相关推荐

在 C# 中在运行时创建 pdf 文件

在 Windows 和 Silverlight 类库之间共享 C# 代码

C# 我可以在运行时加载 .NET 程序集并实例化只知道名称的类型吗？

C# 如何在某个控件上获取鼠标位置

相关推荐

最近更新

标签