在 Byte[] 数组中查找第一个特定字节 c#

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/974888/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-06 04:38:45  来源:igfitidea点击:

Find First Specific Byte in a Byte[] Array c#

c#design-patternsbytearraysearch

提问by divinci

I have a byte array and wish to find the first occurance (if any) of a specific byte.

我有一个字节数组,并希望找到特定字节的第一次出现(如果有)。

Can you guys help me with a nice, elegant and efficient way to do it?

你们能帮我用一个漂亮、优雅和有效的方式来做吗?

 /// Summary
/// Finds the first occurance of a specific byte in a byte array.
/// If not found, returns -1.
public int GetFirstOccurance(byte byteToFind, byte[] byteArray)
{

}

采纳答案by Philippe Leybaert

public static int GetFirstOccurance(byte byteToFind, byte[] byteArray)
{
   return Array.IndexOf(byteArray,byteToFind);
}

It will return -1 if not found

如果未找到,它将返回 -1

Or as Sam pointed out, an extension method:

或者正如 Sam 指出的那样,一个扩展方法:

public static int GetFirstOccurance(this byte[] byteArray, byte byteToFind)
{
   return Array.IndexOf(byteArray,byteToFind);
}

Or to make it generic:

或使其通用:

public static int GetFirstOccurance<T>(this T[] array, T element)
{
   return Array.IndexOf(array,element);
}

Then you can just say:

然后你可以说:

int firstIndex = byteArray.GetFirstOccurance(byteValue);

回答by Greg Beech

回答by Glenn Slayden

Since you mentioned efficiency, here is some heavily optimized C# codeI've written which uses native addressing and maximal qword-aligned reading to cut the number of memory accesses by a factor of 8. I would be surprised if there is any faster way to scan for a byte in memory in .NET.

既然你提到了效率,这里是我编写的一些高度优化的 C# 代码,它使用本机寻址和最大 qword 对齐读取将内存访问次数减少了 8 倍。如果有任何更快的方法,我会感到惊讶在.NET 中扫描内存中的一个字节。

This returns the index of the first occurrence of byte 'v'within the range of memory starting at offset i(relative to address src), and continuing for length c. Returns -1if byte vis not found.

这将返回从 offset (相对于 address )开始的内存范围内第一次出现字节 'v'索引,并继续 length 。如果找不到字节,则返回-1isrccv

// fast IndexOf byte in memory. (To use this with managed byte[] array, see below)
public unsafe static int IndexOfByte(byte* src, byte v, int i, int c)
{
    ulong t;
    byte* p, pEnd;

    for (p = src + i; ((long)p & 7) != 0; c--, p++)
        if (c == 0)
            return -1;
        else if (*p == v)
            return (int)(p - src);

    ulong r = v; r |= r << 8; r |= r << 16; r |= r << 32;

    for (pEnd = p + (c & ~7); p < pEnd; p += 8)
    {
        t = *(ulong*)p ^ r;
        t = (t - 0x0101010101010101) & ~t & 0x8080808080808080;
        if (t != 0)
        {
            t &= (ulong)-(long)t;
            return (int)(p - src) + dbj8[t * 0x07EDD5E59A4E28C2 >> 58];
        }
    }

    for (pEnd += c & 7; p < pEnd; p++)
        if (*p == v)
            return (int)(p - src);

    return -1;
}

Don't be alarmed by the one multiplication you see; it's only executed a maximum of once per call of this function in order to do a final deBruijn lookup. The read-only lookup table used for that is a simple shared list of 64 byte values which requires one-time only initialization:

不要被你看到的一个乘法吓到;每次调用此函数最多只执行一次,以便进行最终的deBruijn 查找。用于此的只读查找表是一个简单的 64 字节值共享列表,它只需要一次初始化:

// elsewhere in the static class...

readonly static sbyte[] dbj8 =
{
     7, -1, -1, -1, -1,  5, -1, -1, -1,  4, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1,  6, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1,  3, -1, -1, -1, -1, -1, -1,  1, -1,  2,  0, -1, -1,
};

The -1values are never accessed and can be left at zero if desired instead, as shown in the following alternative to the preceding table initialization code, if you prefer:

这些-1值永远不会被访问,如果需要,可以将其保留为零,如果您愿意,如以下替代前面的表初始化代码所示:

static MyStaticClass()
{
    dbj8 = new sbyte[64];  // initialize the lookup table (alternative to the above)
    dbj8[0x00] = 7;
    dbj8[0x18] = 6;
    dbj8[0x05] = 5;
    dbj8[0x09] = 4;
    dbj8[0x33] = 3;
    dbj8[0x3C] = 2;
    dbj8[0x3A] = 1;
 /* dbj8[0x3D] = 0; */
}

readonly static sbyte[] dbj8, dbj16;

For completeness, here is how to use the function with the method prototype provided by the OP in the original question.

为了完整起见,这里是如何使用原始问题中 OP 提供的方法原型的函数。

/// Finds the first occurrence of a specific byte in a byte array.
/// If not found, returns -1.
public static unsafe int GetFirstOccurance(byte byteToFind, byte[] byteArray)
{
    fixed (byte* p = byteArray)
        return IndexOfByte(p, byteToFind, 0, byteArray.Length);
}

Discussion
My code is a bit intricate, so detailed examination is left as an exercise for the interested reader. You can study another take on the general approach of gang-wise memory searching in the .NETinternal method Buffer.IndexOfByte, but that code has significant drawbacks compared to mine:

讨论
我的代码有点复杂,所以详细的检查留给感兴趣的读者作为练习。您可以在.NET内部方法Buffer.IndexOfByte 中研究另一种关于成组内存搜索的一般方法,但该代码与我的相比有明显的缺点:

  • Most significantly, the .NET code only scans 4 bytes at time instead of 8 as in mine.
  • It's a non-public method, so you'd need to use reflection to call it.
  • The .NET code has a "performance leak" where the t1 != 0check gives a false positive, and the four checks that follow are wasted. Note their "fall-through" case: due to this false-positive, they need four final checks--thus allowing fall-through--to maintain correctness, instead of just three.
  • The .NET code's false-positive is caused by an inherently inferior bitwise computation based on overflowof the carry bit from one byte to the next. This leads to two's complementasymmetries (evidenced by their use of constants 0x7efefeffor 0x81010100) and the occasional "left-wise egress" (i.e., loss) of information regarding the most-significant byte, which is the real problem here. In contrast, I use an underflowcomputation which keeps each byte's computation independent of its neighbors'. My method gives a conclusive result in all cases with no false-positive or "fall-through" processing.
  • My code uses a branchless techniquefor the final lookup. A handful of non-branching logical operations (plus one multiplication in this case) is generally believed to favor performance over extended if-elsestructures, since the latter can disrupt CPU predictive caching. This issue is more important for my 8-byte scanner because without using lookup I'd have twice as many if-elseconditions in the final check, as compared to a 4-byte gang-wise scanner.
  • 最重要的是,.NET 代码一次只扫描 4 个字节,而不是我的 8 个字节。
  • 这是一个非公共方法,因此您需要使用反射来调用它。
  • .NET 代码存在“性能泄漏”,其中t1 != 0检查给出了误报,随后的四项检查都被浪费了。请注意他们的“失败”情况:由于这种误报,他们需要四次最终检查——因此允许失败——以保持正确性,而不是仅仅三次。
  • .NET 代码的误报是由基于进位位从一个字节溢出到下一个字节的固有劣质按位计算引起的。这会导致二进制补码不对称(通过使用常量0x7efefeffor证明0x81010100)和偶尔“向左出口”(即丢失)有关最高有效字节的信息,这是这里的真正问题。相比之下,我使用下溢计算使每个字节的计算独立于其邻居。我的方法在所有情况下都给出了结论性的结果,没有误报或“失败”处理。
  • 我的代码使用无分支技术进行最终查找。少数非分支逻辑操作(在这种情况下加上一个乘法)通常被认为比扩展if-else结构有利于性能,因为后者会破坏CPU 预测缓存。这个问题对于我的 8 字节扫描器来说更重要if-else,因为与 4 字节成组扫描器相比,如果不使用查找,我在最终检查中的条件数量会增加一倍。

Of course if you're not concerned with all this minutiae you can just copy and use the code; I've unit-tested it quite exhaustively and verified correct behavior for all well-formed inputs. So while the core functionality is ready to use, you'll probably want to add argument checking.

当然,如果您不关心所有这些细节,您可以复制和使用代码;我已经对它进行了非常详尽的单元测试,并验证了所有格式正确的输入的正确行为。因此,虽然核心功能已准备就绪,但您可能希望添加参数检查。



[edit:]

[编辑:]

String.IndexOf(String s, Char char, int ix_start, int count) ... fast!

Because the above method has worked so successfully in my projects, I extended it to cover 16-bit searching. Here is the same code adapted to search for a 16-bit short, ushort, or charprimitive instead of byte. This adapted method was also independently verified against its own respective unit-test methodology adapted from above.

由于上述方法在我的项目中非常成功,因此我将其扩展为涵盖 16 位搜索。下面是适用于搜索16 位 short、ushort 或 char原语而不是byte. 这种改编的方法也根据其从上面改编的各自的单元测试方法进行了独立验证。

static MyStaticClass()
{
    dbj16 = new sbyte[64];
 /* dbj16[0x3A] = 0; */
    dbj16[0x33] = 1;
    dbj16[0x05] = 2;
    dbj16[0x00] = 3;
}
readonly static sbyte[] dbj16;

public static int IndexOf(ushort* src, ushort v, int i, int c)
{
    ulong t;
    ushort* p, pEnd;

    for (p = src + i; ((long)p & 7) != 0; c--, p++)
        if (c == 0)
            return -1;
        else if (*p == v)
            return (int)(p - src);

    ulong r = ((ulong)v << 16) | v;
    r |= r << 32;

    for (pEnd = p + (c & ~7); p < pEnd; p += 4)
    {
        t = *(ulong*)p ^ r;
        t = (t - 0x0001000100010001) & ~t & 0x8000800080008000;
        if (t != 0)
        {
            i = dbj16[(t & (ulong)-(long)t) * 0x07EDD5E59A4E28C2 >> 58];
            return (int)(p - src) + i;
        }
    }

    for (pEnd += c & 7; p < pEnd; p++)
        if (*p == v)
            return (int)(p - src);

    return -1;
}

And below are the various overloads for accessing this for the remaining 16-bit primitives, plus String(last one shown):

以下是用于访问其余 16 位原语的各种重载,以及String(显示的最后一个):

public static int IndexOf(this char[] rg, char v) => IndexOf(rg, v, 0, rg.Length);
public static int IndexOf(this char[] rg, char v, int i, int c = -1)
{
    if (rg != null && (c = c < 0 ? rg.Length - i : c) > 0)
        fixed (char* p = rg)
            return IndexOf((ushort*)p, v, i, c < 0 ? rg.Length - i : c);
    return -1;
}

public static int IndexOf(this short[] rg, short v) => IndexOf(rg, v, 0, rg.Length);
public static int IndexOf(this short[] rg, short v, int i, int c = -1)
{
    if (rg != null && (c = c < 0 ? rg.Length - i : c) > 0)
        fixed (short* p = rg)
            return IndexOf((ushort*)p, (ushort)v, i, c < 0 ? rg.Length - i : c);
    return -1;
}

public static int IndexOf(this ushort[] rg, ushort v) => IndexOf(rg, v, 0, rg.Length);
public static int IndexOf(this ushort[] rg, ushort v, int i, int c = -1)
{
    if (rg != null && (c = c < 0 ? rg.Length - i : c) > 0)
        fixed (ushort* p = rg)
            return IndexOf(p, v, i, c < 0 ? rg.Length - i : c);
    return -1;
}
public static int IndexOf(String s, Char ch, int i = 0, int c = -1)
{
    if (s != null && (c = c < 0 ? s.Length - i : c) > 0)
        fixed (char* p = s)
            return IndexOf((ushort*)p, ch, i, c);
    return -1;
}

Notice that the Stringoverload is not marked as an extension method since this higher-performance replacement version of the function would never be called that way (built-in methods with the same name always take precedence over extension methods). To use it as an extension on Stringinstances, you can change the name of this method. As an example, IndexOf__(this String s,...)would cause it to appear next to the built-in method namein Intellisenselistings, perhaps a helpful reminder to opt-in. Otherwise, if you don't need extension syntax, you can just make sure you call this optimized version directly as a static method of its own class when you want to use it instead of s.IndexOf(Char ch).

请注意,String重载未标记为扩展方法,因为该函数的这种更高性能的替换版本永远不会以这种方式调用(具有相同名称的内置方法始终优先于扩展方法)。要将其用作String实例的扩展,您可以更改此方法的名称。例如,IndexOf__(this String s,...)会导致它出现在Intellisense列表中的内置方法名称旁边,这可能是选择加入的有用提醒。否则,如果您不需要扩展语法,您可以确保在您想使用它而不是.s.IndexOf(Char ch)