如何使用 C# 解析文本文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/858756/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 04:31:40  来源:igfitidea点击:

How to parse a text file with C#

c#parsingtext

提问by Ivan Prodanov

By text formatting I meant something more complicated.

通过文本格式,我的意思是更复杂的东西。

At first I began manually adding the 5000 lines from the text file I'm asking this question for,into my project.

起初,我开始手动将我问这个问题的文本文件中的 5000 行添加到我的项目中。

The text file has 5000 lines with different length.For example:

文本文件有 5000 行不同长度。例如:

1   1   ITEM_ETC_GOLD_01    ??(?)   xxx xxx xxx_TT_DESC 0   0   3   3   5   0   180000  3   0   1   0   0   255 1   1   0   0   0   0   0   0   0   0   0   0   -1  0   -1  0   -1  0   -1  0   -1  0   0   0   0   0   0   0   100 0   0   0   xxx item\etc\drop_ch_money_small.bsr    xxx xxx xxx 0   2   0   0   1   0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0   0   0   0   0   0   0   0   0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1   ??? ??? ?(param1??) -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx 0   0

1   4   ITEM_ETC_HP_POTION_01   HP ?? ??    xxx SN_ITEM_ETC_HP_POTION_01    SN_ITEM_ETC_HP_POTION_01_TT_DESC    0   0   3   3   1   1   180000  3   0   1   1   1   255 3   1   0   0   1   0   60  0   0   0   1   21  -1  0   -1  0   -1  0   -1  0   -1  0   0   0   0   0   0   0   100 0   0   0   xxx item\etc\drop_ch_bag.bsr    item\etc\hp_potion_01.ddj   xxx xxx 50  2   0   0   1   0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0   0   0   0   0   0   0   0   0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 120 HP???   0   HP???(%)    0   MP???   0   MP???(%)    -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx 0   0

1   5   ITEM_ETC_HP_POTION_02   HP ??? (?)  xxx SN_ITEM_ETC_HP_POTION_02    SN_ITEM_ETC_HP_POTION_02_TT_DESC    0   0   3   3   1   1   180000  3   0   1   1   1   255 3   1   0   0   1   0   110 0   0   0   2   39  -1  0   -1  0   -1  0   -1  0   -1  0   0   0   0   0   0   0   100 0   0   0   xxx item\etc\drop_ch_bag.bsr    item\etc\hp_potion_02.ddj   xxx xxx 50  2   0   0   2   0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0   0   0   0   0   0   0   0   0   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 220 HP???   0   HP???(%)    0   MP???   0   MP???(%)    -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx -1  xxx 0   0

The text between the first character(1) and the second character(1/4/5) is not a whitespace,it's a tab.There's no whitespaces in that text file.

第一个字符 (1) 和第二个字符 (1/4/5) 之间的文本不是空格,而是制表符。该文本文件中没有空格。

What I want:

我想要的是:

I want to get the second integer(In the three lines I posted above,the second integers are 1,4 and 5) and the string in the middle of each line indicating the path(It starts with "item\" and ends with the file extension ".ddj").

我想得到第二个整数(在我上面发布的三行中,第二个整数是 1,4 和 5)和每行中间的字符串表示路径(它以“item\”开头,以文件扩展名“.ddj”)。

My problem:

我的问题:

When I google "Text formatting C#" - all I get is how to open a text file and how to write a text file in C#.I don't know how to search for text inside a text file.Also I can't search for the first integer,because in case its a small integer like in the three lines I posted above,I wont be able to find the corrent location,because for example "1" might exist in a different location.

当我谷歌“文本格式 C#”时 - 我得到的只是如何打开文本文件以及如何在 C# 中编写文本文件。我不知道如何在文本文件中搜索文本。我也无法搜索对于第一个整数,因为如果它是我上面发布的三行中的小整数,我将无法找到正确的位置,因为例如“1”可能存在于不同的位置。

My question:

我的问题:

It would be the best If I write a program that would delete anything,but what I need.

如果我编写一个可以删除任何内容但我需要的程序,那将是最好的。

The other way in my mind is to directly search inside that file,but as I mentioned above - I might get the wrong location of the second integer if its too low.

我认为的另一种方法是直接在该文件中搜索,但正如我上面提到的 - 如果第二个整数太低,我可能会得到错误的位置。

Please suggest something,I can't format all this by hand.

请提出建议,我无法手动格式化所有这些。

采纳答案by Samir Talwar

OK, here's what we do: open the file, read it line by line, and split it by tabs. Then we grab the second integer and loop through the rest to find the path.

好的,这就是我们要做的:打开文件,逐行读取,然后按制表符拆分。然后我们获取第二个整数并循环遍历其余部分以找到路径。

StreamReader reader = File.OpenText("filename.txt");
string line;
while ((line = reader.ReadLine()) != null) 
{
    string[] items = line.Split('\t');
    int myInteger = int.Parse(items[1]);   // Here's your integer.

    // Now let's find the path.
    string path = null;
    foreach (string item in items) 
    {
        if (item.StartsWith("item\") && item.EndsWith(".ddj"))
            path = item;
    }

    // At this point, `myInteger` and `path` contain the values we want
    // for the current line. We can then store those values or print them,
    // or anything else we like.
}

回答by erikkallen

You could do something like:

你可以这样做:

using (TextReader rdr = OpenYourFile()) {
    string line;
    while ((line = rdr.ReadLine()) != null) {
        string[] fields = line.Split('\t'); // THIS LINE DOES THE MAGIC
        int theInt = Convert.ToInt32(fields[1]);
    }
}

The reason you didn't find relevant result when searching for 'formatting' is that the operation you are performing is called 'parsing'.

您在搜索“格式化”时未找到相关结果的原因是您正在执行的操作称为“解析”。

回答by Marc Vitalis

Try regular expressions. You can find a certain pattern in your text and replace it with something that you want. I can't give you the exact code right now but you can test out your expressions using this.

试试正则表达式。您可以在文本中找到某个模式并将其替换为您想要的内容。我现在不能给你确切的代码,但你可以使用它来测试你的表达式。

http://www.radsoftware.com.au/regexdesigner/

http://www.radsoftware.com.au/regexdesigner/

回答by Justin Ethier

You could open the file up and use StreamReader.ReadLine to read the file in line-by-line. Then you can use String.Split to break each line into pieces (use a \t delimiter) to extract the second number.

您可以打开文件并使用 StreamReader.ReadLine 逐行读取文件。然后您可以使用 String.Split 将每一行分成几部分(使用 \t 分隔符)来提取第二个数字。

As the number of items is different you would need to search the string for the pattern 'item\*.ddj'.

由于项目数量不同,您需要在字符串中搜索“item\*.ddj”模式。

To delete an item you could (for example) keep all of the file's contents in memory and write out a new file when the user clicks 'Save'.

要删除一个项目,您可以(例如)将文件的所有内容保存在内存中,并在用户单击“保存”时写出一个新文件。

回答by Samir Talwar

Another solution, this time making use of regular expressions:

另一个解决方案,这次使用正则表达式:

using System.Text.RegularExpressions;

...

Regex parts = new Regex(@"^\d+\t(\d+)\t.+?\t(item\[^\t]+\.ddj)");

StreamReader reader = FileInfo.OpenText("filename.txt");
string line;
while ((line = reader.ReadLine()) != null) {
    Match match = parts.Match(line);
    if (match.Success) {
        int number = int.Parse(match.Group(1).Value);
        string path = match.Group(2).Value;

        // At this point, `number` and `path` contain the values we want
        // for the current line. We can then store those values or print them,
        // or anything else we like.
    }
}

That expression's a little complex, so here it is broken down:

这个表达式有点复杂,所以在这里分解:

^        Start of string
\d+      "\d" means "digit" - 0-9. The "+" means "one or more."
         So this means "one or more digits."
\t       This matches a tab.
(\d+)    This also matches one or more digits. This time, though, we capture it
         using brackets. This means we can access it using the Group method.
\t       Another tab.
.+?      "." means "anything." So "one or more of anything". In addition, it's lazy.
         This is to stop it grabbing everything in sight - it'll only grab as much
         as it needs to for the regex to work.
\t       Another tab.

(item\[^\t]+\.ddj)
    Here's the meat. This matches: "item\<one or more of anything but a tab>.ddj"

回答by Vin

Like it's already mentioned, I would highly recommend using regular expression (in System.Text) to get this kind of job done.

就像已经提到的那样,我强烈建议使用正则表达式(在 System.Text 中)来完成这种工作。

In combo with a solid tool like RegexBuddy, you are looking at handling any complex text record parsing situations, as well as getting results quickly. The tool makes it real easy.

RegexBuddy 之类的可靠工具结合使用,您正在考虑处理任何复杂的文本记录解析情况,以及快速获得结果。该工具使它变得非常容易。

Hope that helps.

希望有帮助。

回答by Mark Green

One way that I've found really useful in situations like this is to go old-school and use the Jet OLEDB provider, together with a schema.ini file to read large tab-delimited files in using ADO.Net. Obviously, this method is really only useful if you know the format of the file to be imported.

我发现在这种情况下非常有用的一种方法是使用 Jet OLEDB 提供程序,以及使用 ADO.Net 读取大型制表符分隔文件的 schema.ini 文件。显然,这种方法只有在您知道要导入的文件的格式时才有用。

public void ImportCsvFile(string filename)
{
    FileInfo file = new FileInfo(filename);

    using (OleDbConnection con = 
            new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=\"" +
            file.DirectoryName + "\";
            Extended Properties='text;HDR=Yes;FMT=TabDelimited';"))
    {
        using (OleDbCommand cmd = new OleDbCommand(string.Format
                                  ("SELECT * FROM [{0}]", file.Name), con))
        {
            con.Open();

            // Using a DataReader to process the data
            using (OleDbDataReader reader = cmd.ExecuteReader())
            {
                while (reader.Read())
                {
                    // Process the current reader entry...
                }
            }

            // Using a DataTable to process the data
            using (OleDbDataAdapter adp = new OleDbDataAdapter(cmd))
            {
                DataTable tbl = new DataTable("MyTable");
                adp.Fill(tbl);

                foreach (DataRow row in tbl.Rows)
                {
                    // Process the current row...
                }
            }
        }
    }
} 

Once you have the data in a nice format like a datatable, filtering out the data you need becomes pretty trivial.

一旦您拥有像数据表这样的良好格式的数据,过滤掉您需要的数据就变得非常简单。