C# 如何在 .NET 中从可能的 Windows 1252 'ANSI' 编码上传文件转换为 UTF8?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/469859/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 04:39:54  来源:igfitidea点击:

How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

c#asp.netvb.netunicode

提问by

I've got a FileUploadcontrol in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.

FileUpload在 ASP.NET 网页中有一个控件,用于上传文件,其内容(在流中)在后面的 C# 代码中处理并稍后在页面上输出,使用HtmlEncode.

But, some of this output is becoming mangled, specifically the symbol '£' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.

但是,这些输出中的一些正在变得混乱,特别是符号“£”作为 Unicode FFFD REPLACEMENT CHARACTER 输出。我已经将其跟踪到输入文件,该文件是 Windows 1252 ('ANSI') 编码的。

The question is,

问题是,

  1. How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and

  2. How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol £ etc?

  1. 如何确定文件是编码为 1252 还是 UTF8?它可以是,并且

  2. 如果它在 Windows 1252 中,如何将其转换为 UTF8,保留符号 £ 等?

I've looked online but cannot find a satisfactory answer.

我在网上查过,但找不到满意的答案。

回答by Jim Mischel

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:

如果您知道该文件是使用 Windows 1252 编码的,则可以使用 StreamReader 打开该文件并传递正确的编码。那是:

StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);

The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.

“true”告诉它根据文件前面的字节顺序标记设置编码(如果它们在那里)。否则,它会以 Windows-1252 的形式打开它。

You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.

然后您可以读取该文件,如果您想转换为 UTF-8,请写入您使用该结束编码打开的文件。

The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.

对您的第一个问题的简短回答是,没有一种 100% 令人满意的方法来确定文件的编码。如果有字节顺序标记,您可以确定它是什么风格的 Unicode,但没有 BOM,您只能使用启发式方法来确定编码。

I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.

我没有很好的启发式参考。您可以搜索“记事本如何确定字符集”。我记得前段时间看到过这样的事情。

In practice, I've found the following to work for most of what I do:

在实践中,我发现以下内容适用于我所做的大部分工作:

StreamReader reader = new StreamReader("filename", Encoding.Default, true);

Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

我阅读的大多数文件都是我使用 .NET 的 StreamWriter 创建的文件,它们是带有 BOM 的 UTF-8。我得到的其他文件通常是用一些不理解 Unicode 或代码页的工具编写的,我只是把它当作一个字节流,Encoding.Default 做得很好。