C# 在 .NET 中处理大型 csv 的最有效方法

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14342560/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-10 11:31:28  来源:igfitidea点击:

Most efficient way to process a large csv in .NET

c#.netvb.netsearchcsv

提问by user1981003

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:

原谅我的笨拙,但我只需要一些指导,我找不到另一个可以回答这个问题的问题。我有一个相当大的 csv 文件(~300k 行),我需要确定给定的输入,csv 中的任何行是否以该输入开头。我已经按字母顺序对 csv 进行了排序,但我不知道:

1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?

1) 如何处理 csv 中的行 - 我应该将它作为列表/集合读取,还是使用 OLEDB、嵌入式数据库或其他东西?

2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)

2)如何有效地从字母列表中找到一些东西(利用它的排序来加快速度,而不是搜索整个列表)

采纳答案by Louis Ricci

You don't give enough specifics to give you a concrete answer but...

你没有给出足够的细节来给你一个具体的答案,但是......



IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.

如果 CSV 文件经常更改,则使用 OLEDB 并根据您的输入更改 SQL 查询。

string sql = @"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
          @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath + 
          ";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))


IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.

如果 CSV 文件不经常更改并且您对其运行了大量“查询”,请将其加载到内存中并每次都快速搜索它。

IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.

如果您希望搜索与列完全匹配,请使用字典,其中键是要匹配的列,值是行数据。

Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...

IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx

如果您希望您的搜索是像 StartsWith 这样的部分匹配,那么有 1 个包含您的可搜索数据(即:第一列)的数组和另一个包含您的行数据的列表或数组。然后使用 C# 内置的二进制搜索http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx

string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
    foundIdx = ~foundIdx;
    if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
        result = SortedRows[foundIdx];
    }
} else {
    result = SortedRows[foundIdx];
}

NOTEcode was written inside the browser window and may contain syntax errors as it wasn't tested.

NOTE代码是在浏览器窗口中编写的,可能包含语法错误,因为它没有经过测试。

回答by Dave Bish

Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary searchto have exceptionally fast searching.

鉴于 CSV 已排序 - 如果您可以将整个内容加载到内存中(如果您需要做的唯一处理是每行上的 .StartsWith()) - 您可以使用二进制搜索进行异常快速的搜索。

Maybe something like this (NOT TESTED!):

也许是这样的(未测试!):

var csv = File.ReadAllLines(@"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());

...

...

public class StartsWithComparer: IComparer<string>
{
    public int Compare(string x, string y)
    {
        if(x.StartsWith(y))
            return 0;
        else
            return x.CompareTo(y);
    }
}

回答by Matthias

Try the free CSV Reader. No Need to invent the wheel over and over again ;)

试用免费的CSV 阅读器。无需一遍又一遍地发明轮子;)

1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)

1)如果您不需要存储结果,只需遍历 CSV - 处理每一行并忘记它。如果您需要一次又一次地处理所有行,请将它们存储在 List 或 Dictionary 中(当然还有一个好的键)

2) Try the generic extension methods like this

2)尝试这样的通用扩展方法

var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));

回答by Steven Doggart

If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionaryobject. The Dictionaryclass stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.

如果可以将数据缓存在内存中,并且只需要在一个主键列上搜索列表,我建议将数据作为Dictionary对象存储在内存中。在Dictionary类存储的数据作为在哈希表中的键/值对。您可以使用主键列作为字典中的键,然后使用其余列作为字典中的值。在哈希表中按键查找项目通常非常快。

For instance, you could load the data into a dictionary, like this:

例如,您可以将数据加载到字典中,如下所示:

Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
    parser.TextFieldType = FieldType.Delimited;
    parser.SetDelimiters(",");
    while (!parser.EndOfData)
    {
        try
        {
            string[] fields = parser.ReadFields();
            data[fields[0]] = fields;
        }
        catch (MalformedLineException ex)
        {
            // ...
        }
    }
}

And then you could get the data for any item, like this:

然后你可以获取任何项目的数据,如下所示:

string fields[] = data["key I'm looking for"];

回答by Adriano Repetti

If your file is in memory(for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection searchmethod. You can start with the code on this question on CodeReview, just change the comparer to work with stringinstead of intand to check only the beginning of each line.

如果您的文件在内存中(例如因为您进行了排序)并且您将其保留为字符串(行)数组,那么您可以使用简单的二分搜索方法。您可以从CodeReview上有关此问题的代码开始,只需更改比较器即可使用,string而不是int仅检查每行的开头。

If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:

如果您每次都必须重新读取文件,因为它可能会被更改或被另一个程序保存/排序,那么最简单的算法是最好的算法:

using (var stream = File.OpenText(path))
{
    // Replace this with you comparison, CSV splitting
    if (stream.ReadLine().StartsWith("..."))
    {
        // The file contains the line with required input
    }
}

Of course you may read the entire file in memory(to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal(you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.

当然,您每次都可以读取内存中的整个文件(以使用 LINQ 或List<T>.BinarySearch()),但这远非最佳(即使您可能只需要检查几行,您也会读取所有内容)并且文件本身甚至可能太大.

If you reallyneed something more and you do not have your file in memory because of sorting (but you should profileyour actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.

如果您确实需要更多内容并且由于排序而在内存中没有文件(但您应该根据您的要求分析您的实际性能),您必须实施更好的搜索算法,例如Boyer-Moore algorithm

回答by China Syndrome

Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""})to Let n = P.Split(New Char() {","})

这是我的 VB.net 代码。它用于报价合格 CSV,因此对于常规 CSV,更改Let n = P.Split(New Char() {""","""})Let n = P.Split(New Char() {","})

Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
    Let n = P.Split(New Char() {""","""}) _
    Order by n(5) _
    Select New With {
        .Doc =n(1), _
        .Loc = n(3), _
        .Chart = n(5), _
        .PatientID= n(31), _
        .Title = n(13), _
        .FirstName = n(9), _
        .MiddleName = n(11), _
        .LastName = n(7), 
        .StatusID = n(41) _
        }
Patz.dump

回答by paparazzo

OP stated really just needs to search based on line.

OP表示真的只需要根据线路进行搜索。

The questions is then to hold the lines in memory or not.

然后问题是是否将这些线路保存在内存中。

If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.

如果行 1 k 那么 300 mb 的内存。
如果一行是 1 兆,那么 300 GB 的内存。

Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.

Stream.Readline 将具有低内存配置文件
由于它已排序,您可以在它大于时停止查找。

If you hold it in memory then a simple

如果你把它保存在内存中,那么一个简单的

List<String> 

With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.

随着 LINQ 将工作。
LINQ 不够聪明,无法利用这种排序,但针对 300K 仍然会很快。

BinarySearch will take advantage of the sort.

BinarySearch 将利用排序。

回答by paparazzo

If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)

如果每次程序运行只执行一次,这看起来非常快。(根据以下评论更新为使用 StreamReader 而不是 FileStream)

    static string FindRecordBinary(string search, string fileName)
    {
        using (StreamReader fs = new StreamReader(fileName))
        {
            long min = 0; // TODO: What about header row?
            long max = fs.BaseStream.Length;
            while (min <= max)
            {
                long mid = (min + max) / 2;
                fs.BaseStream.Position = mid;

                fs.DiscardBufferedData();
                if (mid != 0) fs.ReadLine();
                string line = fs.ReadLine();
                if (line == null) { min = mid+1; continue; }

                int compareResult;
                if (line.Length > search.Length)
                    compareResult = String.Compare(
                        line, 0, search, 0, search.Length, false );
                else
                    compareResult = String.Compare(line, search);

                if (0 == compareResult) return line;
                else if (compareResult > 0) max = mid-1;
                else min = mid+1;
            }
        }
        return null;
    }

This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)

对于 50 兆字节的 600,000 条记录测试文件,这运行时间为 0.007 秒。相比之下,文件扫描平均超过半秒,具体取决于记录所在的位置。(相差100倍)

Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.

显然,如果你不止一次这样做,缓存会加快速度。进行部分缓存的一种简单方法是保持 StreamReader 打开并重新使用它,每次只需重置 min 和 max。这将节省您始终在内存中存储 50 兆字节的情况。

EDIT:Added knaki02's suggested fix.

编辑:添加了 knaki02 的建议修复。

回答by Joel Coehoorn

Normally I would recommend finding a dedicated CSV parser (like thisor this). However, I noticed this line in your question:

通常我会建议找到一个专用的 CSV 解析器(像这样这样)。但是,我在您的问题中注意到了这一行:

I need to determine for a given input, whether any line in the csv begins with that input.

我需要确定给定的输入,csv 中的任何行是否以该输入开头。

That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.

这告诉我,在确定之前解析 CSV 数据的计算机时间是浪费时间。您只需要代码来简单地将文本与文本匹配,并且您可以像其他任何事情一样轻松地通过字符串比较来做到这一点。

Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by faryour best performing result, but it will also by farrequire the most initial work and maintenance.

此外,您提到数据已排序。这应该可以让您大大加快速度……但您需要注意,要利用这一点,您需要编写自己的代码来对低级文件流进行搜索调用。这将是迄今为止您表现最好的结果,但它也需要迄今为止最初始的工作和维护。

I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.

我推荐一种基于工程的方法,您可以设定一个性能目标,构建一些相对简单的东西,然后根据该目标衡量结果。特别是,从我上面发布的第二个链接开始。那里的 CSV 阅读器一次只会将一条记录加载到内存中,因此它的性能应该相当不错,而且很容易上手。构建使用该阅读器的东西,并衡量结果。如果他们达到了你的目标,那就停在那里。

If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.

如果它们不符合您的目标,请修改链接中的代码,以便在阅读每一行时首先进行字符串比较(在费心解析 csv 数据之前),并且只为解析 csv 的行进行工作比赛。这应该会表现得更好,但只有在第一个选项不符合您的目标时才能完成工作。准备就绪后,再次测量性能。

Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.

最后,如果您仍然没有达到性能目标,我们将进入编写低级代码以使用搜索调用对文件流进行二进制搜索的领域。这可能是你能做到的最好的,在性能方面,但它会编写非常混乱和容易出错的代码,所以如果你完全没有达到前面步骤中的目标,你只想去这里.

Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

请记住,性能是一项功能,就像任何其他功能一样,您需要根据实际设计目标评估如何为该功能构建。“尽可能快”并不是一个合理的设计目标。诸如“在 0.25 秒内响应用户搜索”之类的东西是一个真正的设计目标,如果更简单但速度较慢的代码仍然满足该目标,则您需要停在那里。

回答by Sam Jones

I wrote this quickly for work, could be improved on...

我很快就写了这篇工作,可以改进......

Define the column numbers:

定义列号:

private enum CsvCols
{
    PupilReference = 0,
    PupilName = 1,
    PupilSurname = 2,
    PupilHouse = 3,
    PupilYear = 4,
}

Define the Model

定义模型

public class ImportModel
{
    public string PupilReference { get; set; }
    public string PupilName { get; set; }
    public string PupilSurname { get; set; }
    public string PupilHouse { get; set; }
    public string PupilYear { get; set; }
}

Import and populate a list of models:

导入并填充模型列表:

  var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();

    var pupils = rows.Select(x => new ImportModel
    {
        PupilReference = x[(int) CsvCols.PupilReference],
        PupilName = x[(int) CsvCols.PupilName],
        PupilSurname = x[(int) CsvCols.PupilSurname],
        PupilHouse = x[(int) CsvCols.PupilHouse],
        PupilYear = x[(int) CsvCols.PupilYear],

    }).ToList();

Returns you a list of strongly typed objects

返回一个强类型对象列表