vb.net 比较两个字节数组的最快方法是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/627742/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 14:06:55  来源:igfitidea点击:

What is the fastest way to compare two byte arrays?

vb.netarrayscompare

提问by Middletone

I am trying to compare two long bytearrays in VB.NET and have run into a snag. Comparing two 50 megabyte files takes almost two minutes, so I'm clearly doing something wrong. I'm on an x64 machine with tons of memory so there are no issues there. Here is the code that I'm using at the moment and would like to change.

我正在尝试比较 VB.NET 中的两个长字节数组,但遇到了障碍。比较两个 50 兆字节的文件几乎需要两分钟,所以我显然做错了什么。我在一台有大量内存的 x64 机器上,所以那里没有问题。这是我目前正在使用并希望更改的代码。

_Bytesand item.Bytesare the two different arrays to compare and are already the same length.

_Bytesitem.Bytes是要比较的两个不同的数组,并且长度已经相同。

For Each B In item.Bytes
   If B <> _Bytes(I) Then
        Mismatch = True
        Exit For
   End If
   I += 1
Next

I need to be able to compare as fast as possible files that are potentially hundreds of megabytes and even possibly a gigabyte or two. Any suggests or algorithms that would be able to do this faster?

我需要能够尽可能快地比较可能有数百兆字节甚至可能是一两个千兆字节的文件。任何能够更快地做到这一点的建议或算法?

Item.bytesis an object taken from the database/filesystem that is returned to compare, because its byte length matches the item that the user wants to add. By comparing the two arrays I can then determine if the user has added something new to the DB and if not then I can just map them to the other file and not waste hard disk drive space.

Item.bytes是从数据库/文件系统中取出的对象,返回进行比较,因为它的字节长度与用户想要添加的项目相匹配。通过比较这两个数组,我可以确定用户是否向数据库添加了新内容,如果没有,我可以将它们映射到另一个文件,而不会浪费硬盘驱动器空间。

[Update]

[更新]

I converted the arrays to local variables of Byte() and then did the same comparison, same code and it ran in like one second (I have to benchmark it still and compare it to others), but if you do the same thing with local variables and use a generic array it becomes massively slower. I'm not sure why, but it raises a lot more questions for me about the use of arrays.

我将数组转换为 Byte() 的局部变量,然后进行相同的比较,相同的代码,它运行了大约一秒钟(我仍然必须对其进行基准测试并将其与其他人进行比较),但是如果您对本地执行相同的操作变量并使用通用数组它会变得非常慢。我不知道为什么,但它给我带来了更多关于数组使用的问题。

回答by Jon Skeet

What is the _Bytes(I)call doing? It's not loading the file each time, is it? Even with buffering, that would be bad news!

什么是_Bytes(I)呼叫在做什么?它不是每次都加载文件,是吗?即使有缓冲,那也是个坏消息!

There will be plenty of ways to micro-optimisethis in terms of looking at longs at a time, potentially using unsafe code etc - but I'd just concentrate on getting reasonableperformance first. Clearly there's something very odd going on.

有很多方法可以在一次查看 longs方面对其进行微优化,可能会使用不安全的代码等 - 但我只是专注于首先获得合理的性能。很明显,发生了一些非常奇怪的事情。

I suggest you extract the comparison code into a separate function which takes two byte arrays. That way you know you won't be doing anything odd. I'd also use a simple Forloop rather than For Eachin this case - it'll be simpler. Oh, and check whether the lengths are correct first :)

我建议您将比较代码提取到一个单独的函数中,该函数采用两个字节数组。这样你就知道你不会做任何奇怪的事情。我也会使用一个简单的For循环,而不是For Each在这种情况下 - 它会更简单。哦,先检查长度是否正确:)

EDIT: Here's the code (untested, but simple enough) that I'd use. It's in C# for the minute - I'll convert it in a sec:

编辑:这是我使用的代码(未经测试,但足够简单)。暂时是在 C# 中 - 我将在一秒钟内转换它:

public static bool Equals(byte[] first, byte[] second)
{
    if (first == second)
    {
        return true;
    }
    if (first == null || second == null)
    {
        return false;
    }
    if (first.Length != second.Length)
    {
        return false;
    }
    for (int i=0; i < first.Length; i++)
    {
        if (first[i] != second[i])                
        {
            return false;
        }
    }
    return true;
}

EDIT: And here's the VB:

编辑:这是VB:

Public Shared Function ArraysEqual(ByVal first As Byte(), _
                                   ByVal second As Byte()) As Boolean
    If (first Is second) Then
        Return True
    End If

    If (first Is Nothing OrElse second Is Nothing) Then
        Return False
    End If
    If  (first.Length <> second.Length) Then
         Return False
    End If

    For i as Integer = 0 To first.Length - 1
        If (first(i) <> second(i)) Then
            Return False
        End If
    Next i
    Return True
End Function

回答by sfossen

If you don't need to know the byte, use 64-bit ints that gives you 8 at once. Actually, you can figure out the wrong byte, once you've isolated it to a set of 8.

如果您不需要知道字节,请使用一次为您提供 8 的 64 位整数。实际上,一旦将其隔离为一组 8,您就可以找出错误的字节。

Use BinaryReader:

使用BinaryReader

saveTime  = binReader.ReadInt32()

Or for arrays of ints:

或者对于整数数组:

Dim count As Integer = binReader.Read(testArray, 0, 3)

回答by danobrega

The fastest way to compare two byte arrays of equal size is to use interop. Run the following code on a console application:

比较两个相同大小的字节数组的最快方法是使用互操作。在控制台应用程序上运行以下代码:

using System;
using System.Runtime.InteropServices;
using System.Security;

namespace CompareByteArray
{
    class Program
    {
        static void Main(string[] args)
        {
            const int SIZE = 100000;
            const int TEST_COUNT = 100;

            byte[] arrayA = new byte[SIZE];
            byte[] arrayB = new byte[SIZE];

            for (int i = 0; i < SIZE; i++)
            {
                arrayA[i] = 0x22;
                arrayB[i] = 0x22;
            }

            {
                DateTime before = DateTime.Now;
                for (int i = 0; i < TEST_COUNT; i++)
                {
                    int result = MemCmp_Safe(arrayA, arrayB, (UIntPtr)SIZE);

                    if (result != 0) throw new Exception();
                }
                DateTime after = DateTime.Now;

                Console.WriteLine("MemCmp_Safe: {0}", after - before);
            }

            {
                DateTime before = DateTime.Now;
                for (int i = 0; i < TEST_COUNT; i++)
                {
                    int result = MemCmp_Unsafe(arrayA, arrayB, (UIntPtr)SIZE);

                    if (result != 0) throw new Exception();
                }
                DateTime after = DateTime.Now;

                Console.WriteLine("MemCmp_Unsafe: {0}", after - before);
            }


            {
                DateTime before = DateTime.Now;
                for (int i = 0; i < TEST_COUNT; i++)
                {
                    int result = MemCmp_Pure(arrayA, arrayB, SIZE);

                    if (result != 0) throw new Exception();
                }
                DateTime after = DateTime.Now;

                Console.WriteLine("MemCmp_Pure: {0}", after - before);
            }
            return;
        }

        [DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint="memcmp", ExactSpelling=true)]
        [SuppressUnmanagedCodeSecurity]
        static extern int memcmp_1(byte[] b1, byte[] b2, UIntPtr count);

        [DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint = "memcmp", ExactSpelling = true)]
        [SuppressUnmanagedCodeSecurity]
        static extern unsafe int memcmp_2(byte* b1, byte* b2, UIntPtr count);

        public static int MemCmp_Safe(byte[] a, byte[] b, UIntPtr count)
        {
            return memcmp_1(a, b, count);
        }

        public unsafe static int MemCmp_Unsafe(byte[] a, byte[] b, UIntPtr count)
        {
            fixed(byte* p_a = a)
            {
                fixed (byte* p_b = b)
                {
                    return memcmp_2(p_a, p_b, count);
                }
            }
        }

        public static int MemCmp_Pure(byte[] a, byte[] b, int count)
        {
            int result = 0;
            for (int i = 0; i < count && result == 0; i += 1)
            {
                result = a[0] - b[0];
            }

            return result;
        }

    }
}

回答by user4014848

Better approach... If you are just trying to see if the two are different then save some time by not having to go through the entire byte array and generate a hash of each byte array as strings and compare the strings. MD5 should work fine and is pretty efficient.

更好的方法...如果您只是想看看两者是否不同,那么不必遍历整个字节数组并将每个字节数组的哈希生成为字符串并比较字符串,从而节省一些时间。MD5 应该可以正常工作并且非常有效。

回答by Clayton

I see two things that might help:

我看到两件事可能会有所帮助:

First, rather than always accessing the second array as item.Bytes, use a local variable to point directly at the array. That is, before starting the loop, do something like this:

首先,不要总是以 item.Bytes 的形式访问第二个数组,而是使用局部变量直接指向该数组。也就是说,在开始循环之前,做这样的事情:

 array2 = item.Bytes

That will save the overhead of dereferencing from the object each time you want a byte. That could be expensive in Visual Basic, especially if there's a Getter method on that property.

这将节省每次需要一个字节时从对象中取消引用的开销。这在 Visual Basic 中可能很昂贵,尤其是在该属性上有 Getter 方法的情况下。

Also, use a "definite loop" instead of "for each". You already know the length of the arrays, so just code the loop using that value. This will avoid the overhead of treating the array as a collection. The loop would look something like this:

另外,使用“确定循环”而不是“for each”。您已经知道数组的长度,因此只需使用该值对循环进行编码。这将避免将数组视为集合的开销。循环看起来像这样:

For i = 1 to max Step 1
   If (array1(i) <> array2(i)) 
       Exit For
   EndIf 
Next

回答by Sergio Acosta

Not strictly related to the comparison algorithm:

与比较算法不严格相关:

Are you sure your bottleneck is not related to the memory available and the time used to load the byte arrays? Loading two 2 GB byte arrays just to compare them could bring most machines to their knees. If the program design allows, try using streams to read smaller chunks instead.

您确定您的瓶颈与可用内存和用于加载字节数组的时间无关吗?加载两个 2 GB 字节数组只是为了比较它们可能会让大多数机器屈服。如果程序设计允许,请尝试使用流来读取较小的块。