在 C# 中遍历字符串中单个字符的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8793762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-09 04:28:04  来源:igfitidea点击:

What is the fastest way to iterate through individual characters in a string in C#?

c#string

提问by Joshua Honig

The title is the question. Below is my attempt to answer it through research. But I don't trust my uninformed research so I still pose the question (What is the fastest way to iterate through individual characters in a string in C#?).

标题就是问题。下面是我试图通过研究来回答它。但是我不相信我不知情的研究,所以我仍然提出这个问题(在 C# 中遍历字符串中单个字符的最快方法是什么?)。

Occasionally I want to cycle through the characters of a string one-by-one, such as when parsing for nested tokens -- something which cannot be done with regular expressions. I am wondering what the fastest way is to iterate through the individual characters in a string, particularly very large strings.

有时我想一个一个地循环遍历字符串的字符,例如在解析嵌套标记时——这是正则表达式无法完成的。我想知道最快的方法是遍历字符串中的单个字符,尤其是非常大的字符串。

I did a bunch of testing myself and my results are below. However there are many readers with much more in depth knowledge of the .NET CLR and C# compiler so I don't know if I'm missing something obvious, or if I made a mistake in my test code. So I solicit your collective response. If anyone has insight into how the string indexer actually works that would be very helpful. (Is it a C# language feature compiled into something else behind the scenes? Or something built in to the CLR?).

我自己做了很多测试,结果如下。然而,有很多读者对 .NET CLR 和 C# 编译器有更深入的了解,所以我不知道我是否遗漏了一些明显的东西,或者我是否在测试代码中犯了错误。所以我征求你们的集体回应。如果有人深入了解字符串索引器的实际工作方式,那将非常有帮助。(它是 C# 语言功能在幕后编译成其他东西吗?还是内置到 CLR 中的东西?)。

The first method using a stream was taken directly from the accepted answer from the thread: how to generate a stream from a string?

使用流的第一种方法直接取自线程的已接受答案:如何从字符串生成流?

Tests

测试

longStringis a 99.1 million character string consisting of 89 copies of the plain-text version of the C# language specification. Results shown are for 20 iterations. Where there is a 'startup' time (such as for the first iteration of the implicitly created array in method #3), I tested that separately, such as by breaking from the loop after the first iteration.

longString是一个 9910 万个字符串,由 C# 语言规范的纯文本版本的 89 个副本组成。显示的结果是 20 次迭代。在有“启动”时间的地方(例如方法#3 中隐式创建的数组的第一次迭代),我单独测试了它,例如在第一次迭代后中断循环。

Results

结果

From my tests, caching the string in a char array using the ToCharArray() method is the fastest for iterating over the entire string. The ToCharArray() method is an upfront expense, and subsequent access to individual characters is slightly faster than the built in index accessor.

根据我的测试,使用 ToCharArray() 方法将字符串缓存在字符数组中是迭代整个字符串的最快方法。ToCharArray() 方法是一项前期费用,对单个字符的后续访问比内置索引访问器稍快。

                                           milliseconds
                                ---------------------------------
 Method                         Startup  Iteration  Total  StdDev
------------------------------  -------  ---------  -----  ------
 1 index accessor                     0        602    602       3
 2 explicit convert ToCharArray     165        410    582       3
 3 foreach (c in string.ToCharArray)168        455    623       3
 4 StringReader                       0       1150   1150      25
 5 StreamWriter => Stream           405       1940   2345      20
 6 GetBytes() => StreamReader       385       2065   2450      35
 7 GetBytes() => BinaryReader       385       5465   5850      80
 8 foreach (c in string)              0        960    960       4

Update:Per @Eric's comment, here are results for 100 iterations over a more normal 1.1 M char string (one copy of the C# spec). Indexer and char arrays are still fastest, followed by foreach(char in string), followed by stream methods.

更新:根据@Eric 的评论,这里是对更正常的 1.1 M 字符字符串(C# 规范的一个副本)进行 100 次迭代的结果。Indexer 和 char 数组仍然是最快的,其次是 foreach(char in string),其次是流方法。

                                           milliseconds
                                ---------------------------------
 Method                         Startup  Iteration  Total  StdDev
------------------------------  -------  ---------  -----  ------
 1 index accessor                     0        6.6    6.6    0.11
 2 explicit convert ToCharArray     2.4        5.0    7.4    0.30
 3 for(c in string.ToCharArray)     2.4        4.7    7.1    0.33
 4 StringReader                       0       14.0   14.0    1.21
 5 StreamWriter => Stream           5.3       21.8   27.1    0.46
 6 GetBytes() => StreamReader       4.4       23.6   28.0    0.65
 7 GetBytes() => BinaryReader       5.0       61.8   66.8    0.79
 8 foreach (c in string)              0       10.3   10.3    0.11     

Code Used (tested separately; shown together for brevity)

使用的代码(单独测试;为简洁起见一起显示)

//1 index accessor
int strLength = longString.Length;
for (int i = 0; i < strLength; i++) { c = longString[i]; }

//2 explicit convert ToCharArray
int strLength = longString.Length;
char[] charArray = longString.ToCharArray();
for (int i = 0; i < strLength; i++) { c = charArray[i]; }

//3 for(c in string.ToCharArray)
foreach (char c in longString.ToCharArray()) { } 

//4 use StringReader
int strLength = longString.Length;
StringReader sr = new StringReader(longString);
for (int i = 0; i < strLength; i++) { c = Convert.ToChar(sr.Read()); }

//5 StreamWriter => StreamReader 
int strLength = longString.Length;
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream);
writer.Write(longString);
writer.Flush();
stream.Position = 0;
StreamReader str = new StreamReader(stream);
while (stream.Position < strLength) { c = Convert.ToChar(str.Read()); } 

//6 GetBytes() => StreamReader
int strLength = longString.Length;
MemoryStream stream = new MemoryStream(Encoding.Unicode.GetBytes(longString));
StreamReader str = new StreamReader(stream);
while (stream.Position < strLength) { c = Convert.ToChar(str.Read()); }

//7 GetBytes() => BinaryReader 
int strLength = longString.Length;
MemoryStream stream = new MemoryStream(Encoding.Unicode.GetBytes(longString));
BinaryReader br = new BinaryReader(stream, Encoding.Unicode);
while (stream.Position < strLength) { c = br.ReadChar(); }

//8 foreach (c in string)
foreach (char c in longString) { } 

Accepted answer:

接受的答案:

I interpreted @CodeInChaos and Ben's notes as follows:

我将@CodeInChaos 和 Ben 的笔记解释如下:

fixed (char* pString = longString) {
    char* pChar = pString;
    for (int i = 0; i < strLength; i++) {
        c = *pChar ;
        pChar++;
    }
}

Execution for 100 iterations over the short string was 4.4 ms, with < 0.1 ms st dev.

短字符串上 100 次迭代的执行时间为 4.4 毫秒,st dev 小于 0.1 毫秒。

采纳答案by Ben Voigt

The fastest answer is to use C++/CLI: How to: Access Characters in a System::String

最快的答案是使用 C++/CLI: How to: Access Characters in a System::String

This approach iterates through the characters in-place in the string using pointer arithmetic. There are no copies, no implicit range checks, and no per-element function calls.

这种方法使用指针算法在字符串中就地遍历字符。没有副本,没有隐式范围检查,也没有每个元素的函数调用。

It's likely possible to get (nearly, C++/CLI doesn't require pinning) the same performance from C# by writing an unsafe C# version of PtrToStringChars.

通过编写不安全的 C# 版本的PtrToStringChars.

Something like:

就像是:

unsafe char* PtrToStringContent(string s, out GCHandle pin)
{
    pin = GCHandle.Alloc(s, GCHandleType.Pinned);
    return (char*)pin.AddrOfPinnedObject().Add(System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData).ToPointer();
}
unsafe char* PtrToStringContent(string s, out GCHandle pin)
{
    pin = GCHandle.Alloc(s, GCHandleType.Pinned);
    return (char*)pin.AddrOfPinnedObject().Add(System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData).ToPointer();
}

Do remember to call GCHandle.Freeafterwards.

GCHandle.Free事后记得打电话。

CodeInChaos's comment points out that C# provides a syntactic sugar for this:

CodeInChaos 的评论指出 C# 为此提供了一个语法糖:

fixed(char* pch = s) { ... }

回答by Jon Skeet

Any reason not to include foreach?

foreach什么理由不包括?

foreach (char c in text)
{
    ...
}

Is this reallygoing to be your performance bottleneck, by the way? What proportion of your total running time does the iteration itself take?

顺便说一句,这真的会成为您的性能瓶颈吗?迭代本身在总运行时间中所占的比例是多少?

回答by Olivier Jacot-Descombes

If speed really matters foris faster than foreach

如果速度真的很重for要比foreach

for (int i = 0; i < text.Length; i++) {
   char ch = text[i];
   ...
}

回答by Hans Passant

These kind of artificial tests are pretty dangerous. Notable is that your //2 and //3 versions of the code never actually indexes the string. The jitter optimizer just throws away the code since the c variable isn't used at all. You are just measuring how long the for() loop takes. You can't really see this unless you look at the generated machine code.

这种人工测试非常危险。值得注意的是,您的 //2 和 //3 版本的代码实际上从未对字符串进行索引。抖动优化器只是丢弃了代码,因为根本没有使用 c 变量。您只是在测量 for() 循环需要多长时间。除非您查看生成的机器代码,否则您无法真正看到这一点。

Change it to c += longString[i];to force the array indexer to be used.

将其更改c += longString[i];为强制使用数组索引器。

Which is nonsense of course. Profile only realcode.

这当然是无稽之谈。配置文件只有真正的代码。

回答by L.B

If micro optimization is very important for you, then try this. (I assumed input string's length to be multiple of 8 for simplicity)

如果微优化对你来说非常重要,那么试试这个。(为简单起见,我假设输入字符串的长度是 8 的倍数)

unsafe void LoopString()
{
    fixed (char* p = longString)
    {
        char c1,c2,c3,c4;
        Int64 len = longString.Length;
        Int64* lptr = (Int64*)p;
        Int64 l;
        for (int i = 0; i < len; i+=8)
        {
            l = *lptr;
            c1 = (char)(l & 0xffff);
            c2 = (char)(l >> 16);
            c3 = (char)(l >> 32);
            c4 = (char)(l >> 48);
            lptr++;
        }
    }
}

Just kidding, never use this code :)

开个玩笑,永远不要使用此代码:)

回答by Pieter van Ginkel

TL;DR: a simple foreachis the fastest way to iterate a string.

TL;DR:简单foreach是迭代字符串的最快方法。

For people coming back to this: times change!

对于人们回到这一点:时代变了!

With the latest .NET 64-bit JIT, the unsafe version actually is the slowest.

使用最新的 .NET 64 位 JIT,不安全版本实际上是最慢的。

Below is a benchmark implementation for BenchmarkDotNet. From these, I got the following results:

下面是 BenchmarkDotNet 的基准实现。从这些中,我得到了以下结果:

          Method |      Mean |     Error |    StdDev |
---------------- |----------:|----------:|----------:|
        Indexing | 5.9712 us | 0.8738 us | 0.3116 us |
 IndexingOnArray | 8.2907 us | 0.8208 us | 0.2927 us |
  ForEachOnArray | 8.1919 us | 0.6505 us | 0.1690 us |
         ForEach | 5.6946 us | 0.0648 us | 0.0231 us |
          Unsafe | 7.2952 us | 1.1050 us | 0.3941 us |

The interesting ones are the one that do not work on an array copy. This shows that indexing and foreachare very similar in performance, with a 5% difference, foreachbeing faster. Using unsafeis actually 28% slower than using a foreach.

有趣的是那些不适用于数组副本的。这表明 indexing 和foreach在性能上非常相似,有 5% 的差异,foreach速度更快。使用unsafe实际上比使用foreach.

In the past unsafemay have been the fastest option, but JIT's get faster and smarter all the time.

过去unsafe可能是最快的选择,但 JIT 一直在变得更快、更智能。

As a reference, the benchmark code:

作为参考,基准代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Horology;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

namespace StringIterationBenchmark
{
    public class StringIteration
    {
        public static void Main(string[] args)
        {
            var config = new ManualConfig();

            config.Add(DefaultConfig.Instance);

            config.Add(Job.Default
                .WithLaunchCount(1)
                .WithIterationTime(TimeInterval.FromMilliseconds(500))
                .WithWarmupCount(3)
                .WithTargetCount(6)
            );

            BenchmarkRunner.Run<StringIteration>(config);
        }

        private readonly string _longString = BuildLongString();

        private static string BuildLongString()
        {
            var sb = new StringBuilder();
            var random = new Random();

            while (sb.Length < 10000)
            {
                char c = (char)random.Next(char.MaxValue);
                if (!Char.IsControl(c))
                    sb.Append(c);
            }

            return sb.ToString();
        }

        [Benchmark]
        public char Indexing()
        {
            char c = '
//8 foreach (c in string)
foreach (char c in longString) { } 
'; var longString = _longString; int strLength = longString.Length; for (int i = 0; i < strLength; i++) { c |= longString[i]; } return c; } [Benchmark] public char IndexingOnArray() { char c = '##代码##'; var longString = _longString; int strLength = longString.Length; char[] charArray = longString.ToCharArray(); for (int i = 0; i < strLength; i++) { c |= charArray[i]; } return c; } [Benchmark] public char ForEachOnArray() { char c = '##代码##'; var longString = _longString; foreach (char item in longString.ToCharArray()) { c |= item; } return c; } [Benchmark] public char ForEach() { char c = '##代码##'; var longString = _longString; foreach (char item in longString) { c |= item; } return c; } [Benchmark] public unsafe char Unsafe() { char c = '##代码##'; var longString = _longString; int strLength = longString.Length; fixed (char* p = longString) { var p1 = p; for (int i = 0; i < strLength; i++) { c |= *p1; p1++; } } return c; } } }

The code has a few minor changes from the offered code. The chars that are retrieved from the original string are |-ed with the variable being returned, and we return the value. The reason for this is that we actually need to do something with the result. Otherwise, if we'd just be iterating over the string like:

该代码与提供的代码有一些细微的变化。从原始字符串中检索到的字符|与返回的变量一起进行 -ed,然后我们返回值。这样做的原因是我们实际上需要对结果做一些事情。否则,如果我们只是像这样迭代字符串:

##代码##

the JIT is free to remove this because it could infer that you're not actually observing the results of the iteration. By |-ing the characters in the array and returning this, BenchmarkDotNet will make sure that the JIT can't perform this optimization.

JIT 可以随意删除它,因为它可以推断出您实际上并未观察迭代的结果。通过|-ing 数组中的字符并返回它,BenchmarkDotNet 将确保 JIT 无法执行此优化。