C# 使用 UTF-8 解码文件流

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/876399/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 05:04:25  来源:igfitidea点击:

decode a file stream using UTF-8

c#validationencodingutf-8

提问by George2

I have a XML document, which is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding.

我有一个 XML 文档,它非常大(大约 120M),我不想立即将其加载到内存中。我的目的是检查此文件是否使用有效的 UTF-8 编码。

Any ideas to have a quick check without reading the whole file into memory in the form of byte[]?

有什么想法可以快速检查而不以byte[]?的形式将整个文件读入内存?

I am using VSTS 2008 and C#.

我正在使用 VSTS 2008 和 C#。

When using XMLDocumentto load an XML document, which contains invalid byte sequences, there is an exception, but when reading all content into a byte array and then checking against UTF-8, there is no exception, any ideas?

当使用XMLDocument装载XML文档,其中包含无效的字节序列,有一个例外,但读出全部内容入一个字节数组时,然后与UTF-8检查,没有例外,任何想法?

Here is a screenshot showing the content of my XML file, or you can download a copy of the file from here

这是显示我的 XML 文件内容的屏幕截图,或者您可以从此处下载该文件的副本

enter image description here

在此处输入图片说明

EDIT 1:

编辑 1:

class Program
{
    public static byte[] RawReadingTest(string fileName)
    {
        byte[] buff = null;

        try
        {
            FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read);
            BinaryReader br = new BinaryReader(fs);
            long numBytes = new FileInfo(fileName).Length;
            buff = br.ReadBytes((int)numBytes);
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

        return buff;
    }

    static void XMLTest()
    {
        try
        {
            XmlDocument xDoc = new XmlDocument();
            xDoc.Load("c:\abc.xml");
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
    }

    static void Main()
    {
        try
        {
            XMLTest();
            Encoding ae = Encoding.GetEncoding("utf-8");
            string filename = "c:\abc.xml";
            ae.GetString(RawReadingTest(filename));
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

        return;
    }
}

EDIT 2:When using new UTF8Encoding(true, true)there will be an exception, but when using new UTF8Encoding(false, true), there is no exception thrown. I am confused, because it should be the 2nd parameter which controls whether an exception is thrown (if there are invalid byte sequences), why the 1st parameter matters?

编辑2:使用new UTF8Encoding(true, true)时会出现异常,但使用时new UTF8Encoding(false, true)不会抛出异常。我很困惑,因为它应该是控制是否抛出异常的第二个参数(如果有无效的字节序列),为什么第一个参数很重要?

    public static void TestTextReader2()
    {
        try
        {
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                "c:\a.xml",
                new UTF8Encoding(true, true)
                ))
            {
                int bufferSize = 10 * 1024 * 1024; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                int actualsize = sr.Read(buffer, 0, bufferSize);
                while (actualsize > 0)
                {
                    actualsize = sr.Read(buffer, 0, bufferSize);
                }
            }
        }
        catch (Exception e)
        {
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
            Console.WriteLine(e.Message);
        }

    }

回答by Anton Tykhyy

var buffer = new char[32768] ;

using (var stream = new StreamReader (pathToFile, 
    new UTF8Encoding (true, true)))
{
    while (true)
    try
    {
        if (stream.Read (buffer, 0, buffer.Length) == 0)
            return GoodUTF8File ;
    }
    catch (ArgumentException)
    {
        return BadUTF8File ;
    }
}

回答by ChrisW

@George2 I think they mean a solution like the following (which I haven't tested).

@George2 我认为他们的意思是像下面这样的解决方案(我没有测试过)。

Handling the transition between buffers (i.e. caching extra bytes/partial chars between reads) is the responsibillity and an internal implementation detail of the StreamReader implementation.

处理缓冲区之间的转换(即在读取之间缓存额外的字节/部分字符)是 StreamReader 实现的责任和内部实现细节。

using System;
using System.IO;
using System.Text;

class Test 
{
    public static void Main() 
    {
        try 
        {
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader(
                "TestFile.txt",
                Encoding.UTF8
                ))
            {
                const int bufferSize = 1000; //could be anything
                char[] buffer = new char[bufferSize];
                // Read from the file until the end of the file is reached.
                while (bufferSize == sr.Read(buffer, bufferSize, 0)) 
                {
                    //successfuly decoded another buffer's-worth of data
                }
            }
        }
        catch (Exception e) 
        {
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
            Console.WriteLine(e.Message);
        }
    }
}

回答by Sajay

Would this not work?

这行不通吗?

StreamReader reader = new StreamReader(file);

Console.WriteLine(reader.CurrentEncoding.ToString()); //You get the default encoding
reader.Read();

Console.WriteLine(reader.CurrentEncoding.ToString()); //You get the right encoding. 
reader.Close();

If not can someone help explain why?

如果不是,有人可以帮助解释原因吗?