C# 逐字读取文本文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9740557/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-09 08:39:11  来源:igfitidea点击:

Reading a text file word by word

c#

提问by Matt

I have a text file containing just lowercase letters and no punctuation except for spaces. I would like to know the best way of reading the file char by char, in a way that if the next char is a space, it signifies the end of one word and the start of a new word. i.e. as each character is read it is added to a string, if the next char is space, then the word is passed to another method and reset until the reader reaches the end of the file.

我有一个文本文件,只包含小写字母,除空格外没有标点符号。我想知道按字符读取文件字符的最佳方式,如果下一个字符是空格,则表示一个单词的结束和一个新单词的开始。即当每个字符被读取时,它被添加到一个字符串中,如果下一个字符是空格,那么这个词被传递给另一个方法并重置,直到读取器到达文件的末尾。

I'm trying to do this with a StringReader, something like this:

我正在尝试使用 StringReader 来执行此操作,如下所示:

public String GetNextWord(StringReader reader)
{
    String word = "";
    char c;
    do
    {
        c = Convert.ToChar(reader.Read());
        word += c;
    } while (c != ' ');
    return word;
}

and put the GetNextWord method in a while loop till the end of the file. Does this approach make sense or are there better ways of achieving this?

并将 GetNextWord 方法放入 while 循环中,直到文件末尾。这种方法有意义还是有更好的方法来实现这一目标?

采纳答案by eouw0o83hf

There is a much better way of doing this: string.Split(): if you read the entire string in, C# can automatically split it on every space:

有一个更好的方法来做到这一点::string.Split()如果你读入整个字符串,C# 可以自动在每个空格上拆分它:

string[] words = reader.ReadToEnd().Split(' ');

The wordsarray now contains all of the words in the file and you can do whatever you want with them.

words数组现在包含文件中的所有单词,您可以对它们执行任何操作。

Additionally, you may want to investigate the File.ReadAllTextmethod in the System.IOnamespace - it may make your life much easier for file imports to text.

此外,您可能希望研究命名空间中的File.ReadAllText方法System.IO- 它可能使您将文件导入文本变得更加容易。

Edit: I guess this assumes that your file is not abhorrently large; as long as the entire thing can be reasonably read into memory, this will work most easily. If you have gigabytes of data to read in, you'll probably want to shy away from this. I'd suggest using this approach though, if possible: it makes better use of the framework that you have at your disposal.

编辑:我想这假设您的文件不是很大;只要整个事情可以合理地读入内存,这将最容易工作。如果您有数千兆字节的数据要读入,您可能希望避开这一点。如果可能的话,我建议使用这种方法:它可以更好地利用您可以使用的框架。

回答by Jon

First of all: StringReaderreads from a string which is already in memory. This means that you will have to load up the input file in its entirety before being able to read from it, which kind of defeats the purpose of reading a few characters at a time; it can also be undesirable or even impossible if the input is very large.

首先:StringReader从已经在内存中的字符串中读取。这意味着您必须完整加载输入文件才能从中读取,这违背了一次读取几个字符的目的;如果输入非常大,它也可能是不受欢迎的,甚至是不可能的。

The class to read from a text stream(which is an abstraction over a source of data) is StreamReader, and you would might want to use that one instead. Now StreamReaderand StringReadershare an abstract base class TextReader, which means that if you code against TextReaderthen you can have the best of both worlds.

从文本(它是对数据源的抽象)读取的类是StreamReader,您可能希望改用该类。现在StreamReaderStringReader共享一个抽象基类TextReader,这意味着如果你对代码TextReader,那么你可以有两全其美。

TextReader's public interface will indeed support your example code, so I 'd say it's a reasonable starting point. You just need to fix the one glaring bug: there is no check for Readreturning -1 (which signifies the end of available data).

TextReader的公共接口确实会支持您的示例代码,所以我认为这是一个合理的起点。您只需要修复一个明显的错误:没有检查Read返回 -1(表示可用数据结束)。

回答by Bryan Crosby

All in one line, here you go (assuming ASCII and perhaps not a 2gb file):

全部在一行中,给你(假设 ASCII 并且可能不是 2gb 文件):

var file = File.ReadAllText(@"C:\myfile.txt", Encoding.ASCII).Split(new[] { ' ' });

This returns a string array, which you can iterate over and do whatever you need with.

这将返回一个字符串数组,您可以对其进行迭代并执行任何您需要的操作。

回答by Andrew

This is method that will split your words, while they are separated by space or more than 1 space (two spaces for example)/

这是一种将您的单词分开的方法,同时它们之间用空格或超过 1 个空格(例如两个空格)/

StreamReader streamReader = new StreamReader(filePath); //get the file
string stringWithMultipleSpaces= streamReader.ReadToEnd(); //load file to string
streamReader.Close();

Regex r = new Regex(" +"); //specify delimiter (spaces)
string [] words = r.Split(stringWithMultipleSpaces); //(convert string to array of words)

foreach (String W in words)
{
   MessageBox.Show(W);
}

回答by Eugene

I would do something like this:

我会做这样的事情:

IEnumerable<string> ReadWords(StreamReader reader)
{
    string line;
    while((line = reader.ReadLine())!=null)
    {
        foreach(string word in line.Split(new [1] {' '}, StringSplitOptions.RemoveEmptyEntries))
        {
            yield return word;
        }
    }
}

If to use reader.ReadAllText it loads the entire file into your memory so you can get OutOfMemoryException and a lot of other problems.

如果使用 reader.ReadAllText 它会将整个文件加载到您的内存中,因此您可能会遇到 OutOfMemoryException 和许多其他问题。

回答by Tim Schmelter

If you're interested in good performance even on very large files, you should have a look at the new(4.0) MemoryMappedFile-Class.

如果您对即使在非常大的文件上也有良好的性能感兴趣,您应该看看 new(4.0) MemoryMappedFile-Class

For example:

例如:

using (var mappedFile1 = MemoryMappedFile.CreateFromFile(filePath))
{
    using (Stream mmStream = mappedFile1.CreateViewStream())
    {
        using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
        {
            while (!sr.EndOfStream)
            {
                var line = sr.ReadLine();
                var lineWords = line.Split(' ');
            }
        }  
    }
}

From MSDN:

来自 MSDN:

A memory-mapped file maps the contents of a file to an application's logical address space. Memory-mapped files enable programmers to work with extremely large files because memory can be managed concurrently, and they allow complete, random access to a file without the need for seeking. Memory-mapped files can also be shared across multiple processes.

The CreateFromFile methods create a memory-mapped file from a specified path or a FileStream of an existing file on disk. Changes are automatically propagated to disk when the file is unmapped.

The CreateNew methods create a memory-mapped file that is not mapped to an existing file on disk; and are suitable for creating shared memory for interprocess communication (IPC).

A memory-mapped file is associated with a name.

You can create multiple views of the memory-mapped file, including views of parts of the file. You can map the same part of a file to more than one address to create concurrent memory. For two views to remain concurrent, they have to be created from the same memory-mapped file. Creating two file mappings of the same file with two views does not provide concurrency.

内存映射文件将文件内容映射到应用程序的逻辑地址空间。内存映射文件使程序员能够处理非常大的文件,因为内存可以同时管理,并且它们允许对文件进行完全、随机的访问,而无需进行查找。内存映射文件也可以在多个进程之间共享。

CreateFromFile 方法从指定路径或磁盘上现有文件的 FileStream 创建内存映射文件。取消映射文件时,更改会自动传播到磁盘。

CreateNew 方法创建一个未映射到磁盘上现有文件的内存映射文件;并且适用于为进程间通信(IPC)创建共享内存。

内存映射文件与名称相关联。

您可以创建内存映射文件的多个视图,包括文件部分的视图。您可以将文件的同一部分映射到多个地址以创建并发内存。为了让两个视图保持并发,它们必须从同一个内存映射文件中创建。使用两个视图创建同一文件的两个文件映射不提供并发性。

回答by MaticDiba

If you want to read it whitout spliting the string - for example lines are too long, so you might encounter OutOfMemoryException, you should do it like this (using streamreader):

如果你想在不拆分字符串的情况下阅读它 - 例如行太长,所以你可能会遇到 OutOfMemoryException,你应该这样做(使用streamreader):

while (sr.Peek() >= 0)
{
    c = (char)sr.Read();
    if (c.Equals(' ') || c.Equals('\t') || c.Equals('\n') || c.Equals('\r'))
    {
        break;
    }
    else
        word += c;
}
return word;

回答by AnkUser

I created a simple console program on your exact requirement with the files you mentioned, It should be easy to run and check. Please find attached the code. Hope this helps

我使用您提到的文件根据您的确切要求创建了一个简单的控制台程序,它应该很容易运行和检查。请找到随附的代码。希望这可以帮助

static void Main(string[] args)
    {

        string[] input = File.ReadAllLines(@"C:\Users\achikhale\Desktop\file.txt");
        string[] array1File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array1.txt");
        string[] array2File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array2.txt");

        List<string> finalResultarray1File = new List<string>();
        List<string> finalResultarray2File = new List<string>();

        foreach (string inputstring in input)
        {
            string[] wordTemps = inputstring.Split(' ');//  .Split(' ');

            foreach (string array1Filestring in array1File)
            {
                string[] word1Temps = array1Filestring.Split(' ');

                var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList();

                if (result.Count > 0)
                {
                    finalResultarray1File.AddRange(result);
                }

            }

        }

        foreach (string inputstring in input)
        {
            string[] wordTemps = inputstring.Split(' ');//  .Split(' ');

            foreach (string array2Filestring in array2File)
            {
                string[] word1Temps = array2Filestring.Split(' ');

                var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList();

                if (result.Count > 0)
                {
                    finalResultarray2File.AddRange(result);
                }

            }

        }

        if (finalResultarray1File.Count > 0)
        {
            Console.WriteLine("file array1.txt contians words: {0}", string.Join(";", finalResultarray1File));
        }

        if (finalResultarray2File.Count > 0)
        {
            Console.WriteLine("file array2.txt contians words: {0}", string.Join(";", finalResultarray2File));
        }

        Console.ReadLine();

    }
}

回答by live-love

This code will extract words from a text file based on the Regex pattern. You can try playing with other patterns to see what works best for you.

此代码将根据 Regex 模式从文本文件中提取单词。您可以尝试使用其他模式,看看哪种模式最适合您。

    StreamReader reader =  new StreamReader(fileName);

    var pattern = new Regex(
              @"( [^\W_\d]              # starting with a letter
                                        # followed by a run of either...
                  ( [^\W_\d] |          #   more letters or
                    [-'\d](?=[^\W_\d])  #   ', -, or digit followed by a letter
                  )*
                  [^\W_\d]              # and finishing with a letter
                )",
              RegexOptions.IgnorePatternWhitespace);

    string input = reader.ReadToEnd();

    foreach (Match m in pattern.Matches(input))
        Console.WriteLine("{0}", m.Groups[1].Value);

    reader.Close();