C# 有没有人知道一种更快的方法来做 String.Split()?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/568968/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-04 08:34:54  来源:igfitidea点击:

Does any one know of a faster method to do String.Split()?

c#.netperformancestringcsv

提问by

I am reading each line of a CSV file and need to get the individual values in each column. So right now I am just using:

我正在阅读 CSV 文件的每一行,需要获取每列中的各个值。所以现在我只是在使用:

values = line.Split(delimiter);

where lineis the a string that holds the values that are seperated by the delimiter.

其中line是保存由分隔符分隔的值的字符串。

Measuring the performance of my ReadNextRowmethod I noticed that it spends 66% on String.Split, so I was wondering if someone knows of a faster method to do this.

测量我的ReadNextRow方法的性能,我注意到它在 上花费了 66% String.Split,所以我想知道是否有人知道更快的方法来做到这一点。

Thanks!

谢谢!

采纳答案by cletus

It should be pointed out that split()is a questionable approach for parsing CSV files in case you come across commas in the file eg:

应该指出的split()是,如果您在文件中遇到逗号,解析 CSV 文件的方法是有问题的,例如:

1,"Something, with a comma",2,3

The other thing I'll point out without knowing how you profiled is be careful about profiling this kind of low level detail. The granularity of the Windows/PC timer might come into play and you may have a significant overhead in just looping so use some sort of control value.

在不知道您如何进行分析的情况下,我要指出的另一件事是在分析这种低级细节时要小心。Windows/PC 计时器的粒度可能会发挥作用,您可能会在循环中产生大量开销,因此请使用某种控制值。

That being said, split()is built to handle regular expressions, which are obviously more complex than you need (and the wrong tool to deal with escaped commas anyway). Also, split()creates lots of temporary objects.

话虽如此,它split()是为处理正则表达式而构建的,这显然比您需要的更复杂(无论如何都是处理转义逗号的错误工具)。此外,split()创建了许多临时对象。

So if you want to speed it up (and I have trouble believing that performance of this part is really an issue) then you want to do it by hand and you want to reuse your buffer objects so you're not constantly creating objects and giving the garbage collector work to do in cleaning them up.

所以如果你想加快速度(我很难相信这部分的性能真的是一个问题)那么你想手动完成并且你想重用你的缓冲区对象,这样你就不会不断地创建对象并给予垃圾收集器的工作是清理它们。

The algorithm for that is relatively simple:

其算法相对简单:

  • Stop at every comma;
  • When you hit quotes continue until you hit the next set of quotes;
  • Handle escaped quotes (ie \") and arguably escaped commas (\,).
  • 停在每个逗号处;
  • 当您点击引号时,继续直到您点击下一组引号;
  • 处理转义引号(即\")和可以说是转义的逗号(\,)。

Oh and to give you some idea of the cost of regex, there was a question (Java not C# but the principle was the same) where someone wanted to replace every n-th character with a string. I suggested using replaceAll()on String. Jon Skeet manually coded the loop. Out of curiosity I compared the two versions and his was an order of magnitude better.

哦,为了让您了解正则表达式的成本,有一个问题(Java 不是 C#,但原理是相同的)有人想用字符串替换每个第 n 个字符。我建议replaceAll()在字符串上使用。Jon Skeet 手动编码了循环。出于好奇,我比较了两个版本,他的要好一个数量级。

So if you really want performance, it's time to hand parse.

所以如果你真的想要性能,是时候手动解析了。

Or, better yet, use someone else's optimized solution like this fast CSV reader.

或者,更好的是,使用其他人的优化解决方案,例如这个快速 CSV 阅读器

By the way, while this is in relation to Java it concerns the performance of regular expressions in general (which is universal) and replaceAll()vs a hand-coded loop: Putting char into a java string for each N characters.

顺便说一句,虽然这与 Java 相关,但它涉及一般的正则表达式(这是通用的)和replaceAll()与手动编码循环的性能:将 char 放入 java string for each N characters

回答by John Leidegren

The BCL implementation of string.Split is actually quite fast, I've done some testing here trying to out preform it and it's not easy.

string.Split 的 BCL 实现实际上非常快,我在这里做了一些测试,试图预制它,这并不容易。

But there's one thing you can do and that's to implement this as a generator:

但是您可以做一件事,那就是将其实现为生成器:

public static IEnumerable<string> GetSplit( this string s, char c )
{
    int l = s.Length;
    int i = 0, j = s.IndexOf( c, 0, l );
    if ( j == -1 ) // No such substring
    {
        yield return s; // Return original and break
        yield break;
    }

    while ( j != -1 )
    {
        if ( j - i > 0 ) // Non empty? 
        {
            yield return s.Substring( i, j - i ); // Return non-empty match
        }
        i = j + 1;
        j = s.IndexOf( c, i, l - i );
    }

    if ( i < l ) // Has remainder?
    {
        yield return s.Substring( i, l - i ); // Return remaining trail
    }
}

The above method is not necessarily faster than string.Split for small strings but it returns results as it finds them, this is the power of lazy evaluation. If you have long lines or need to conserve memory, this is the way to go.

对于小字符串,上述方法不一定比 string.Split 快,但它会在找到结果时返回结果,这就是惰性求值的威力。如果您有很长的线路或需要节省内存,这是要走的路。

The above method is bounded by the performance of IndexOf and Substring which does too much index of out range checking and to be faster you need to optimize away these and implement your own helper methods. You can beat the string.Split performance but it's gonna take cleaver int-hacking. You can read my post about that here.

上述方法受 IndexOf 和 Substring 的性能限制,它们执行了过多的范围检查索引,并且为了更快,您需要优化这些并实现您自己的辅助方法。你可以击败 string.Split 性能,但它会需要 cleaver int-hacking。你可以在这里阅读我的帖子。

回答by MSalters

You can assume that String.Split will be close to optimal; i.e. it could be quite hard to improve on it. By far the easier solution is to check whether you need to split the string at all. It's quite likely that you'll be using the individual strings directly. If you define a StringShim class (reference to String, begin & end index) you'll be able to split a String into a set of shims instead. These will have a small, fixed size, and will not cause string data copies.

您可以假设 String.Split 将接近最优;即可能很难对其进行改进。到目前为止,更简单的解决方案是检查是否需要拆分字符串。您很可能会直接使用单个字符串。如果您定义一个 StringShim 类(对 String、begin 和 end 索引的引用),您将能够将 String 拆分为一组垫片。这些将具有较小的固定大小,并且不会导致字符串数据复制。

回答by Dave Van den Eynde

You might think that there are optimizations to be had, but the reality will be you'll pay for them elsewhere.

您可能认为需要进行优化,但实际情况是您将在其他地方为它们付费。

You could, for example, do the split 'yourself' and walk through all the characters and process each column as you encounter it, but you'd be copying all the parts of the string in the long run anyhow.

例如,您可以“自己”进行拆分并遍历所有字符并在遇到它时处理每一列,但无论如何,从长远来看,您将复制字符串的所有部分。

One of the optimizations we could do in C or C++, for example, is replace all the delimiters with '\0' characters, and keep pointers to the start of the column. Then, we wouldn't have to copy all of the string data just to get to a part of it. But this you can't do in C#, nor would you want to.

例如,我们可以在 C 或 C++ 中进行的优化之一是用 '\0' 字符替换所有分隔符,并保留指向列开头的指针。然后,我们不必复制所有字符串数据只是为了获取其中的一部分。但这在 C# 中是做不到的,也不想做。

If there is a big difference between the number of columns that are in the source, and the number of columns that you need, walking the string manually may yield some benefit. But that benefit would cost you the time to develop it and maintain it.

如果源中的列数与您需要的列数之间存在很大差异,手动遍历字符串可能会产生一些好处。但这种好处会花费您开发和维护它的时间。

I've been told that 90% of the CPU time is spent in 10% of the code. There are variations to this "truth". In my opinion, spending 66% of your time in Split is not that bad if processing CSV is the thing that your app needs to do.

有人告诉我 90% 的 CPU 时间花在 10% 的代码上。这个“真相”是有变化的。在我看来,如果处理 CSV 是您的应用程序需要做的事情,那么在 Split 上花费 66% 的时间并不是那么糟糕。

Dave

戴夫

回答by Charlie

Some very thorough analysis on String.Slit() vs Regex and other methods.

对 String.Slit() 与 Regex 和其他方法的一些非常彻底的分析。

We are talking ms savings over very large strings though.

不过,我们正在谈论对非常大的字符串的 ms 节省。

回答by Lasse V. Karlsen

The main problem(?) with String.Split is that it's general, in that it caters for many needs.

String.Split 的主要问题(?)是它的通用性,因为它可以满足许多需求。

If you know more about your data than Split would, it can make an improvement to make your own.

如果您比 Split 更了解您的数据,那么制作您自己的数据会有所改进。

For instance, if:

例如,如果:

  1. You don't care about empty strings, so you don't need to handle those any special way
  2. You don't need to trim strings, so you don't need to do anything with or around those
  3. You don't need to check for quoted commas or quotes
  4. You don't need to handle quotes at all
  1. 你不关心空字符串,所以你不需要以任何特殊方式处理它们
  2. 你不需要修剪字符串,所以你不需要对它们做任何事情或围绕它们做任何事情
  3. 您不需要检查带引号的逗号或引号
  4. 你根本不需要处理引号

If any of these are true, you might see an improvement by writing your own more specific version of String.Split.

如果其中任何一个是正确的,您可能会通过编写自己的更具体的 String.Split 版本来看到改进。

Having said that, the first question you should ask is whether this actually is a problem worth solving. Is the time taken to read and import the file so long that you actually feel this is a good use of your time? If not, then I would leave it alone.

话虽如此,您应该问的第一个问题是这是否真的是一个值得解决的问题。阅读和导入文件所花费的时间是否太长,以至于您实际上觉得这是对时间的很好利用?如果没有,那我就不管了。

The second question is why String.Split is using that much time compared to the rest of your code. If the answer is that the code is doing very little with the data, then I would probably not bother.

第二个问题是与其他代码相比,为什么 String.Split 使用了那么多时间。如果答案是代码对数据的处理很少,那么我可能不会打扰。

However, if, say, you're stuffing the data into a database, then 66% of the time of your code spent in String.Split constitutes a big big problem.

但是,如果,比如说,您将数据填充到数据库中,那么 66% 的代码时间花在 String.Split 中就构成了一个大问题。

回答by kpollock

CSV parsing is actually fiendishly complex to get right, I used classes based on wrapping the ODBC Text driver the one and only time I had to do this.

CSV 解析实际上非常复杂,我使用基于包装 ODBC 文本驱动程序的类,这是我唯一一次必须这样做。

The ODBC solution recommended above looks at first glance to be basically the same approach.

上面推荐的 ODBC 解决方案乍一看是基本相同的方法。

I thoroughly recommend you do some research on CSV parsing before you get too far down a path that nearly-but-not-quite works (all too common). The Excel thing of only double-quoting strings that needit is one of the trickiest to deal with in my experience.

我强烈建议您在走得太远之前对 CSV 解析进行一些研究,该路径几乎但不完全有效(太常见了)。根据我的经验,只有需要它的双引号字符串的 Excel 事情是最难处理的事情之一。

回答by kpollock

Depending on use, you can speed this up by using Pattern.split instead of String.split. If you have this code in a loop (which I assume you probably do since it sounds like you are parsing lines from a file) String.split(String regex) will call Pattern.compile on your regex string every time that statement of the loop executes. To optimize this, Pattern.compile the pattern once outside the loop and then use Pattern.split, passing the line you want to split, inside the loop.

根据使用情况,您可以通过使用 Pattern.split 而不是 String.split 来加快速度。如果您在循环中有此代码(我假设您可能会这样做,因为听起来您正在解析文件中的行) String.split(String regex) 将在每次循环语句时在您的正则表达式字符串上调用 Pattern.compile执行。为了优化这一点,Pattern.compile 模式在循环外一次,然后使用 Pattern.split,在循环内传递要拆分的行。

Hope this helps

希望这可以帮助

回答by chase

String.splitis rather slow, if you want some faster methods, here you go. :)

String.split相当慢,如果你想要一些更快的方法,就在这里。:)

However CSV is much better parsed by a rule based parser.

然而,基于规则的解析器可以更好地解析 CSV。

This guy, has made a rule based tokenizer for java. (requires some copy and pasting unfortunately)

这个家伙,为java做了一个基于规则的标记器。(不幸的是需要一些复制和粘贴)

http://www.csdgn.org/code/rule-tokenizer

http://www.csdgn.org/code/rule-tokenizer

private static final String[] fSplit(String src, char delim) {
    ArrayList<String> output = new ArrayList<String>();
    int index = 0;
    int lindex = 0;
    while((index = src.indexOf(delim,lindex)) != -1) {
        output.add(src.substring(lindex,index));
        lindex = index+1;
    }
    output.add(src.substring(lindex));
    return output.toArray(new String[output.size()]);
}

private static final String[] fSplit(String src, String delim) {
    ArrayList<String> output = new ArrayList<String>();
    int index = 0;
    int lindex = 0;
    while((index = src.indexOf(delim,lindex)) != -1) {
        output.add(src.substring(lindex,index));
        lindex = index+delim.length();
    }
    output.add(src.substring(lindex));
    return output.toArray(new String[output.size()]);
}

回答by codeulike

As others have said, String.Split()will not always work well with CSV files. Consider a file that looks like this:

正如其他人所说,String.Split()并不总是适用于 CSV 文件。考虑一个看起来像这样的文件:

"First Name","Last Name","Address","Town","Postcode"
David,O'Leary,"12 Acacia Avenue",London,NW5 3DF
June,Robinson,"14, Abbey Court","Putney",SW6 4FG
Greg,Hampton,"",,
Stephen,James,"""Dunroamin"" 45 Bridge Street",Bristol,BS2 6TG

(e.g. inconsistent use of speechmarks, strings including commas and speechmarks, etc)

(例如,语音标记的使用不一致,包括逗号和语音标记在内的字符串等)

This CSV reading framework will deal with all of that, and is also very efficient:

这个 CSV 阅读框架将处理所有这些,并且也非常高效:

LumenWorks.Framework.IO.Csv by Sebastien Lorien

LumenWorks.Framework.IO.Csv 由 Sebastien Lorien