.net 确定文本文件编码?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18915633/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 12:27:36  来源:igfitidea点击:

Determine TextFile Encoding?

.netvb.netunicodeencodingcharacter-encoding

提问by ElektroStudios

I need to determine if a text file's content is equal to one of these text encodings:

我需要确定文本文件的内容是否等于以下文本编码之一:

System.Text.Encoding.ASCII
System.Text.Encoding.BigEndianUnicode ' UTF-L 16
System.Text.Encoding.Default ' ANSI
System.Text.Encoding.Unicode ' UTF16
System.Text.Encoding.UTF32
System.Text.Encoding.UTF7
System.Text.Encoding.UTF8

I don't know how to read the byte marks of the files, I've seen snippets doing this but only can determine if file is ASCII or Unicode, therefore I need something more universal.

我不知道如何读取文件的字节标记,我见过这样做的片段,但只能确定文件是 ASCII 还是 Unicode,因此我需要更通用的东西。

回答by Steven Doggart

The first step is to load the file as a byte array instead of as a string. Strings are always stored in memory with UTF-16 encoding, so once it's loaded into a string, the original encoding is lost. Here's a simple example of one way to load a file into a byte array:

第一步是将文件作为字节数组而不是字符串加载。字符串始终以 UTF-16 编码存储在内存中,因此一旦将其加载到字符串中,原始编码就会丢失。这是将文件加载到字节数组的一种方法的简单示例:

Dim data() As Byte = File.ReadAllBytes("test.txt")

Automatically determining the correct encoding for a given byte array is notoriously difficult. Sometimes, to be helpful, the author of the data will insert something called a BOM (Byte Order Mark) at the beginning of the data. If a BOM is present, that makes detecting the encoding painless, since each encoding uses a different BOM.

自动确定给定字节数组的正确编码非常困难。有时,为了有所帮助,数据的作者会在数据的开头插入一个叫做 BOM(字节顺序标记)的东西。如果存在 BOM,则可以轻松检测编码,因为每种编码使用不同的 BOM。

The easiest way to automatically detect the encoding from the BOM is to let the StreamReaderdo it for you. In the constructor of the StreamReader, you can pass Truefor the detectEncodingFromByteOrderMarksargument. Then you can get the encoding of the stream by accessing its CurrentEncodingproperty. However, the CurrentEncodingproperty won't work until after the StreamReaderhas read the BOM. So you you first have to read past the BOM before you can get the encoding, for instance:

从 BOM 中自动检测编码的最简单方法是让其StreamReader为您完成。在的构造函数StreamReader,你可以传递TruedetectEncodingFromByteOrderMarks参数。然后您可以通过访问其CurrentEncoding属性来获取流的编码。但是,在读取 BOMCurrentEncoding之前,该属性将不起作用StreamReader。因此,您首先必须阅读 BOM,然后才能获得编码,例如:

Public Function GetFileEncoding(filePath As String) As Encoding
    Using sr As New StreamReader(filePath, True)
        sr.Read()
        Return sr.CurrentEncoding
    End Using
End Function

However, the problem to this approach is that the MSDNseems to imply that the StreamReadermay only detect certain kinds of encodings:

但是,这种方法的问题在于MSDN似乎暗示StreamReader可能只检测某些类型的编码:

The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. See the Encoding.GetPreamble method for more information.

detectEncodingFromByteOrderMarks 参数通过查看流的前三个字节来检测编码。如果文件以适当的字节顺序标记开头,它会自动识别 UTF-8、小端 Unicode 和大端 Unicode 文本。有关详细信息,请参阅 Encoding.GetPreamble 方法。

Also, if the StreamReaderis incapable of determining the encoding from the BOM, or if the BOM isn't there, it will just default to UTF-8 encoding, without giving you any indication that it failed. If you need more granular control than that, you can pretty easily read the BOM and interpret it yourself. All you have to do is compare the first few bytes in the byte array with some known, expected BOM's to see if they match. Here is a list of some common BOM's:

此外,如果StreamReader无法从 BOM 确定编码,或者 BOM 不存在,则它只会默认为 UTF-8 编码,而不会给您任何失败的迹象。如果您需要比这更精细的控制,您可以很容易地阅读 BOM 并自行解释。您所要做的就是将字节数组中的前几个字节与一些已知的、预期的 BOM 进行比较,以查看它们是否匹配。以下是一些常见 BOM 的列表:

  • UTF-8: EF BB BF
  • UTF-16 big endian byte order: FE FF
  • UTF-16 little endian byte order: FF FE
  • UTF-32 big endian byte order: 00 00 FE FF
  • UTF-32 little endian byte order: FF FE 00 00
  • UTF-8: EF BB BF
  • UTF-16 大端字节序: FE FF
  • UTF-16 小端字节序: FF FE
  • UTF-32 大端字节序: 00 00 FE FF
  • UTF-32 小端字节序: FF FE 00 00

So, for instance, to see if a UTF-16 (little endian) BOM exists at the beginning of the byte array, you could simply do something like this:

因此,例如,要查看字节数组的开头是否存在 UTF-16(小端)BOM,您可以简单地执行以下操作:

If (data(0) = &HFF) And (data(1) = &HFE) Then
    ' Data starts with UTF-16 (little endian) BOM
End If

Conveniently, the Encodingclass in .NET contains a method called GetPreamblewhich returns the BOM used by the encoding, so you don't even need to remember what they all are. So, to check if a byte-array starts with the BOM for Unicode (UTF-16, little-endian), you could just do this:

方便的是,Encoding.NET 中的类包含一个被调用的方法GetPreamble,该方法返回编码使用的 BOM,因此您甚至不需要记住它们都是什么。因此,要检查字节数组是否以 Unicode (UTF-16, little-endian) 的 BOM 开头,您可以这样做:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
    Dim bom() As Byte = Encoding.Unicode.GetPreamble()
    If (data(0) = bom(0)) And (data(1) = bom(1) Then
        Return True
    Else
        Return False
    End If
End Function

Of course, the above function assumes that the data is at least two-bytes in length and the BOM is exactly two bytes. So, while it illustrates how to do it as clearly as possible, it's not the safest way to do it. To make it tolerant of different array lengths, especially since the BOM lengths themselves can vary from one encoding to the next, it would be safer to do something like this:

当然,上面的函数假设数据长度至少是两个字节,BOM正好是两个字节。因此,虽然它尽可能清楚地说明了如何做到这一点,但这并不是最安全的方法。为了使其能够容忍不同的数组长度,特别是因为 BOM 长度本身可能因一种编码而异,所以这样做会更安全:

Function IsUtf16LittleEndian(data() as Byte) As Boolean
    Dim bom() As Byte = Encoding.Unicode.GetPreamble()
    Return data.Zip(bom, Function(x, y) x = y).All(Function(x) x)
End Function

So, the problem then becomes, how do you get a list of all the encodings? Well it just so happens that the .NET Encodingclass also provides a shared (static) method called GetEncodingswhich returns a list of all of the supported encoding objects. Therefore, you could create a method which loops through all of the encoding objects, gets the BOM of each one and compares it to the byte array until you find one that matches. For instance:

那么,问题就变成了,如何获得所有编码的列表?正好,.NETEncoding类还提供了一个共享(静态)方法,调用GetEncodings该方法返回所有支持的编码对象的列表。因此,您可以创建一个循环所有编码对象的方法,获取每个编码对象的 BOM 并将其与字节数组进行比较,直到找到匹配的对象。例如:

Public Function DetectEncodingFromBom(data() As Byte) As Encoding
    Return Encoding.GetEncodings().
        Select(Function(info) info.GetEncoding()).
        FirstOrDefault(Function(enc) DataStartsWithBom(data, enc))
End Function

Private Function DataStartsWithBom(data() As Byte, enc As Encoding) As Boolean
    Dim bom() As Byte = enc.GetPreamble()
    If bom.Length <> 0 Then
        Return data.
            Zip(bom, Function(x, y) x = y).
            All(Function(x) x)
    Else
        Return False
    End If
End Function

Once you make a function like that, then you could detect the encoding of a file like this:

一旦你创建了一个这样的函数,那么你就可以像这样检测文件的编码:

Dim data() As Byte = File.ReadAllBytes("test.txt")
Dim detectedEncoding As Encoding = DetectEncodingFromBom(data)
If detectedEncoding Is Nothing Then
    Console.WriteLine("Unable to detect encoding")
Else
    Console.WriteLine(detectedEncoding.EncodingName)
End If

However, the problem remains, how do you automatically detect the correct encoding when there is no BOM? Technically it's recommended that you don't place a BOM at the beginning of your data when using UTF-8, and there is no BOM defined for any of the ANSI code pages. So it's certainly not out of the realm of possibility that a text file may not have a BOM. If all the files that you deal with are in English, it's probably safe to assume that if no BOM is present, then UTF-8 will suffice. However, if any of the files happen to use something else, without a BOM, then that won't work.

但是,问题依然存在,没有BOM的情况下如何自动检测正确的编码?从技术上讲,建议您在使用 UTF-8 时不要将 BOM 放在数据的开头,并且没有为任何 ANSI 代码页定义 BOM。因此,文本文件可能没有 BOM 肯定是不可能的。如果您处理的所有文件都是英文的,那么假设不存在 BOM,那么 UTF-8 就足够了,这可能是安全的。但是,如果任何文件碰巧使用了其他东西,而没有 BOM,那么这将不起作用。

As you correctly observed, there are applications that still automatically detect the encoding even when no BOM is present, but they do it through heuristics (i.e. educated guessing) and sometimes they are not accurate. Basically they load the data using each encoding and then see if the data "looks" intelligible. This pageoffers some interesting insights on the problems inside the Notepad auto-detection algorithm. This pageshows how you can tap into the COM-based auto-detection algorithm which Internet Explorer uses (in C#). Here is a list of some C# libraries that people have written which attempt to auto-detect the encoding of a byte array, which you may find helpful:

正如您正确观察到的,有些应用程序即使在没有 BOM 的情况下仍会自动检测编码,但它们通过启发式(即有根据的猜测)进行检测,有时它们并不准确。基本上,他们使用每种编码加载数据,然后查看数据是否“看起来”可理解。 这个页面提供了一些关于记事本自动检测算法内部问题的有趣见解。 此页面显示了如何利用 Internet Explorer 使用的基于 COM 的自动检测算法(在 C# 中)。这是人们编写的一些 C# 库的列表,这些库尝试自动检测字节数组的编码,您可能会发现它们很有帮助:

Even though this questionwas for C#, you may also find the answers to it useful.

即使这个问题是针对 C# 的,您也可能会发现它的答案很有用。