C# byte[] 数组模式搜索

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/283456/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 21:33:01  来源:igfitidea点击:

byte[] array pattern search

c#pattern-matching

提问by Anders R

Anyone know a good and effective way to search/match for a byte pattern in an byte[] array and then return the positions.

任何人都知道一种在 byte[] 数组中搜索/匹配字节模式然后返回位置的好方法。

For example

例如

byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125}

回答by Eugene Yokota

You can put the byte array into Stringand run match by IndexOf. Or you can at least reuse existing algorithmson string matching.

您可以将字节数组放入String并通过 IndexOf 运行匹配。或者您至少可以重用现有的字符串匹配算法

    [STAThread]
    static void Main(string[] args)
    {
        byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
        byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,122,22,4,7,89,76,64,12,3,5,76,8,0,6,125};
        string needle, haystack;

        unsafe 
        {
            fixed(byte * p = pattern) {
                needle = new string((SByte *) p, 0, pattern.Length);
            } // fixed

            fixed (byte * p2 = toBeSearched) 
            {
                haystack = new string((SByte *) p2, 0, toBeSearched.Length);
            } // fixed

            int i = haystack.IndexOf(needle, 0);
            System.Console.Out.WriteLine(i);
        }
    }

回答by Tamir

toBeSearched.Except(pattern) will return you differences toBeSearched.Intersect(pattern) will produce set of intersections Generally, you should look into extended methods within Linq extensions

toBeSearched.Except(pattern) 将返回差异 toBeSearched.Intersect(pattern) 将产生一组交集通常,您应该查看 Linq 扩展中的扩展方法

回答by bruno conde

My solution:

我的解决方案:

class Program
{
    public static void Main()
    {
        byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};

        byte[] toBeSearched = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125};

        List<int> positions = SearchBytePattern(pattern, toBeSearched);

        foreach (var item in positions)
        {
            Console.WriteLine("Pattern matched at pos {0}", item);
        }

    }

    static public List<int> SearchBytePattern(byte[] pattern, byte[] bytes)
    {
        List<int> positions = new List<int>();
        int patternLength = pattern.Length;
        int totalLength = bytes.Length;
        byte firstMatchByte = pattern[0];
        for (int i = 0; i < totalLength; i++)
        {
            if (firstMatchByte == bytes[i] && totalLength - i >= patternLength)
            {
                byte[] match = new byte[patternLength];
                Array.Copy(bytes, i, match, 0, patternLength);
                if (match.SequenceEqual<byte>(pattern))
                {
                    positions.Add(i);
                    i += patternLength - 1;
                }
            }
        }
        return positions;
    }
}

回答by Jb Evain

May I suggest something that doesn't involve creating strings, copying arrays or unsafe code:

我可以提出一些不涉及创建字符串、复制数组或不安全代码的建议:

using System;
using System.Collections.Generic;

static class ByteArrayRocks
{    
    static readonly int[] Empty = new int[0];

    public static int[] Locate (this byte[] self, byte[] candidate)
    {
        if (IsEmptyLocate(self, candidate))
            return Empty;

        var list = new List<int>();

        for (int i = 0; i < self.Length; i++)
        {
            if (!IsMatch(self, i, candidate))
                continue;

            list.Add(i);
        }

        return list.Count == 0 ? Empty : list.ToArray();
    }

    static bool IsMatch (byte[] array, int position, byte[] candidate)
    {
        if (candidate.Length > (array.Length - position))
            return false;

        for (int i = 0; i < candidate.Length; i++)
            if (array[position + i] != candidate[i])
                return false;

        return true;
    }

    static bool IsEmptyLocate (byte[] array, byte[] candidate)
    {
        return array == null
            || candidate == null
            || array.Length == 0
            || candidate.Length == 0
            || candidate.Length > array.Length;
    }

    static void Main()
    {
        var data = new byte[] { 23, 36, 43, 76, 125, 56, 34, 234, 12, 3, 5, 76, 8, 0, 6, 125, 234, 56, 211, 122, 22, 4, 7, 89, 76, 64, 12, 3, 5, 76, 8, 0, 6, 125 };
        var pattern = new byte[] { 12, 3, 5, 76, 8, 0, 6, 125 };

        foreach (var position in data.Locate(pattern))
            Console.WriteLine(position);
    }
}

Edit (by IAbstract)- moving contents of posthere since it is not an answer

编辑(由 IAbstract 提供)-帖子内容移至此处,因为它不是答案

Out of curiosity, I've created a small benchmark with different answers.

出于好奇,我创建了一个带有不同答案的小型基准。

Here are the results for a million iterations:

以下是一百万次迭代的结果:

solution [Locate]:            00:00:00.7714027
solution [FindAll]:           00:00:03.5404399
solution [SearchBytePattern]: 00:00:01.1105190
solution [MatchBytePattern]:  00:00:03.0658212

回答by Constantin

Here's my (not the most performant) solution. It relies on the fact that bytes/latin-1 conversion is lossless, which is nottrue for bytes/ASCII or bytes/UTF8 conversions.

这是我的(不是性能最好的)解决方案。它依赖于一个事实,即字节/ Latin-1的转换是无损的,这是为字节/ ASCII或字节/ UTF8转化真。

It's advantages are that It Works (tm) for any byte values (some other solutions work incorrectly with bytes 0x80-0xff) and can be extended to perform more advanced regex matching.

它的优点是它适用于任何字节值(一些其他解决方案无法正确处理字节 0x80-0xff),并且可以扩展以执行更高级的正则表达式匹配。

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

class C {

  public static void Main() {
    byte[] data = {0, 100, 0, 255, 100, 0, 100, 0, 255};
    byte[] pattern = {0, 255};
    foreach (int i in FindAll(data, pattern)) {
      Console.WriteLine(i);
    }
  }

  public static IEnumerable<int> FindAll(
    byte[] haystack,
    byte[] needle
  ) {
    // bytes <-> latin-1 conversion is lossless
    Encoding latin1 = Encoding.GetEncoding("iso-8859-1");
    string sHaystack = latin1.GetString(haystack);
    string sNeedle = latin1.GetString(needle);
    for (Match m = Regex.Match(sHaystack, Regex.Escape(sNeedle));
         m.Success; m = m.NextMatch()) {
      yield return m.Index;
    }
  }
}

回答by Davy Landman

I would use a solution which does matching by converting to a string...

我会使用通过转换为字符串来进行匹配的解决方案......

You should write a simple function implementing the Knuth-Morris-Pratt searching algorithm. This will be the fastest simple algorithm you can use to find the correct indexes.(You could use Boyer-Moorebut it will require more setup.

您应该编写一个实现Knuth-Morris-Pratt 搜索算法的简单函数。这将是您可以用来查找正确索引的最快的简单算法。(您可以使用Boyer-Moore,但它需要更多设置。

After you have optimized the algorithm, you could try to look for other kind of optimizations. But you should start with the basics.

优化算法后,您可以尝试寻找其他类型的优化。但你应该从基础开始。

For example, the current "fastest" is the Locate solution by Jb Evian.

例如,目前“最快”的是 Jb Evian 的 Locate 解决方案。

if you look at the core

如果你看核心

    for (int i = 0; i < self.Length; i++) {
            if (!IsMatch (self, i, candidate))
                    continue;

            list.Add (i);
    }

After a match of the sub algorithm, it will start to find a match at i + 1, but you already know that the first possible match would be i + candidate.Length. So if you add,

在匹配子算法之后,它将开始在 i + 1 处找到匹配项,但您已经知道第一个可能的匹配项将是 i +Candidate.Length。所以如果你添加,

i += candidate.Length -2; //  -2 instead of -1 because the i++ will add the last index

it will be a lot faster when you expect a lot of occurrences of the subset in the superset. (Bruno Conde already does this in his solution)

当您期望超集中出现大量子集时,它会快很多。(布鲁诺康德已经在他的解决方案中做到了这一点)

But this is just a half of the KNP algorithm, you should also add an extra parameter to the IsMatch method called numberOfValidMatches which would be an out parameter.

但这只是 KNP 算法的一半,您还应该向 IsMatch 方法添加一个名为 numberOfValidMatches 的额外参数,这将是一个输出参数。

this would resolve to the following:

这将解决以下问题:

int validMatches = 0;
if (!IsMatch (self, i, candidate, out validMatches))
{
    i += validMatches - 1; // -1 because the i++ will do the last one
    continue;
}

and

static bool IsMatch (byte [] array, int position, byte [] candidate, out int numberOfValidMatches)
{
    numberOfValidMatches = 0;
    if (candidate.Length > (array.Length - position))
            return false;

    for (i = 0; i < candidate.Length; i++)
    {
            if (array [position + i] != candidate [i])
                    return false;
            numberOfValidMatches++; 
    }

    return true;
}

A little bit of refactoring and you could use the numberOfValidMatches as the loop variable, and rewrite the Locate loop using a while to avoid the -2 and -1. But I just wanted to make clear how you could add the KMP algorithm.

稍微重构一下,您可以使用 numberOfValidMatches 作为循环变量,并使用 while 重写 Locate 循环以避免 -2 和 -1。但我只是想说明如何添加 KMP 算法。

回答by VVS

Use the efficient Boyer-Moore algorithm.

使用高效的Boyer-Moore 算法

It's designed to find strings withing strings but you need little imagination to project this to byte arrays.

它旨在查找带有字符串的字符串,但您需要很少的想象力将其投影到字节数组。

In general the best answer is: use any string searching algorithm that you like :).

一般来说,最好的答案是:使用您喜欢的任何字符串搜索算法:)。

回答by Alnitak

Jb Evain's answer has:

Jb Evain 的回答是:

 for (int i = 0; i < self.Length; i++) {
      if (!IsMatch (self, i, candidate))
           continue;
      list.Add (i);
 }

and then the IsMatch function first checks whether candidategoes beyond the length of the array being searched.

然后 IsMatch 函数首先检查是否candidate超出了正在搜索的数组的长度。

This would be more efficient if the forloop were coded:

如果for循环被编码,这将更有效:

     for (int i = 0, n = self.Length - candidate.Length + 1; i < n; ++i) {
          if (!IsMatch (self, i, candidate))
               continue;
          list.Add (i);
     }

at this point one couldalso eliminate the test from the start of IsMatch, so long as you contract via pre-conditions never to call it with "illegal' parameters. Note: Fixed an off-by-one bug in 2019.

在这一点上,你可以从一开始就消除测试IsMatch,只要你通过先决条件签订合同,永远不要用“非法”参数调用它。注意:修复了 2019 年的一个错误。

回答by Davy Landman

I've created a new function using the tips from my answer and the answer by Alnitak.

我使用我的答案中的提示和 Alnitak 的答案创建了一个新函数。

public static List<Int32> LocateSubset(Byte[] superSet, Byte[] subSet)
{
    if ((superSet == null) || (subSet == null))
    {
       throw new ArgumentNullException();
    }
    if ((superSet.Length < subSet.Length) || (superSet.Length == 0) || (subSet.Length == 0))
    {
        return new List<Int32>();
    }
    var result = new List<Int32>();
    Int32 currentIndex = 0;
    Int32 maxIndex =  superSet.Length - subSet.Length;
    while (currentIndex < maxIndex)
    {
         Int32 matchCount = CountMatches(superSet, currentIndex, subSet);
         if (matchCount ==  subSet.Length)
         {
            result.Add(currentIndex);
         }
         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }
    }
    return result;
}

private static Int32 CountMatches(Byte[] superSet, int startIndex, Byte[] subSet)
{
    Int32 currentOffset = 0;
    while (currentOffset < subSet.Length)
    {
        if (superSet[startIndex + currentOffset] != subSet[currentOffset])
        {
            break;
        }
        currentOffset++;
    }
    return currentOffset;
}

the only part I'm not so happy about is the

我唯一不太高兴的部分是

         currentIndex++;
         if (matchCount > 0)
         {
            currentIndex += matchCount - 1;
         }

part... I would have like to use a if else to avoid the -1, but this results in a better branch prediction (although I'm not that sure it will matter that much)..

部分...我想使用 if else 来避免 -1,但这会导致更好的分支预测(尽管我不确定它会那么重要)。

回答by Davy Landman

Why make the simple difficult? This can be done in any language using for loops. Here is one in C#:

为什么要让简单的变得困难?这可以使用 for 循环在任何语言中完成。这是 C# 中的一个:

using System;
using System.Collections.Generic;

namespace BinarySearch
{
    class Program
    {
        static void Main(string[] args)
        {
            byte[] pattern = new byte[] {12,3,5,76,8,0,6,125};
            byte[] toBeSearched = new byte[] {23,36,43,76,125,56,34,234,12,3,5,76,8,0,6,125,234,56,211,
122,22,4,7,89,76,64,12,3,5,76,8,0,6,125}; List<int> occurences = findOccurences(toBeSearched, pattern); foreach(int occurence in occurences) { Console.WriteLine("Found match starting at 0-based index: " + occurence); } } static List<int> findOccurences(byte[] haystack, byte[] needle) { List<int> occurences = new List<int>(); for (int i = 0; i < haystack.Length; i++) { if (needle[0] == haystack[i]) { bool found = true; int j, k; for (j = 0, k = i; j < needle.Length; j++, k++) { if (k >= haystack.Length || needle[j] != haystack[k]) { found = false; break; } } if (found) { occurences.Add(i - 1); i = k; } } } return occurences; } } }