C# 在 .NET 中有效合并字符串数组,保持不同的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/146358/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-03 15:33:59  来源:igfitidea点击:

Efficiently merge string arrays in .NET, keeping distinct values

提问by Jason Anderson

I'm using .NET 3.5. I have two string arrays, which may share one or more values:

我正在使用 .NET 3.5。我有两个字符串数组,它们可能共享一个或多个值:

string[] list1 = new string[] { "apple", "orange", "banana" };
string[] list2 = new string[] { "banana", "pear", "grape" };

I'd like a way to merge them into one array with no duplicate values:

我想要一种方法将它们合并到一个没有重复值的数组中:

{ "apple", "orange", "banana", "pear", "grape" }

I can do this with LINQ:

我可以用 LINQ 做到这一点:

string[] result = list1.Concat(list2).Distinct().ToArray();

but I imagine that's not very efficient for large arrays.

但我想这对于大型阵列来说不是很有效。

Is there a better way?

有没有更好的办法?

采纳答案by Wonko

string[] result = list1.Union(list2).ToArray();

from msdn: "This method excludes duplicates from the return set. This is different behavior to the Concat(TSource) method, which returns all the elements in the input sequences including duplicates."

来自msdn:“此方法从返回集中排除重复项。这与 Concat(TSource) 方法不同,后者返回输入序列中的所有元素,包括重复项。”

回答by petr k.

Probably creating a hashtable with your values as keys (only adding those not already present) and then converting the keys to an array could be a viable solution.

可能使用您的值作为键创建一个哈希表(仅添加那些尚未存在的键),然后将键转换为数组可能是一个可行的解决方案。

回答by Lasse V. Karlsen

DisclaimerThis is premature optimization. For your example arrays, use the 3.5 extension methods. Until you know you have a performance problem in this region, you should use library code.

免责声明这是过早的优化。对于示例数组,请使用 3.5 扩展方法。在您知道您在该区域存在性能问题之前,您应该使用库代码。



If you can sort the arrays, or they're sorted when you get to that point in the code, you can use the following methods.

如果您可以对数组进行排序,或者在代码中到达该点时对它们进行排序,则可以使用以下方法。

These will pull one item from both, and produce the "lowest" item, then fetch a new item from the corresponding source, until both sources are exhausted. In the case where the current item fetched from the two sources are equal, it will produce the one from the first source, and skip them in both sources.

这些将从两者中拉出一个项目,并产生“最低”项目,然后从相应的源中获取一个新项目,直到两个源都用完为止。如果从两个源获取的当前项相等,它将从第一个源生成一个,并在两个源中跳过它们。

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2)
{
    return Merge(source1, source2, Comparer<T>.Default);
}

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2, IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source1))
        throw new ArgumentNullException("source1");
    if (Object.ReferenceEquals(null, source2))
        throw new ArgumentNullException("source2");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T>
        enumerator1 = source1.GetEnumerator(),
        enumerator2 = source2.GetEnumerator())
    {
        Boolean more1 = enumerator1.MoveNext();
        Boolean more2 = enumerator2.MoveNext();

        while (more1 && more2)
        {
            Int32 comparisonResult = comparer.Compare(
                enumerator1.Current,
                enumerator2.Current);
            if (comparisonResult < 0)
            {
                // enumerator 1 has the "lowest" item
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            else if (comparisonResult > 0)
            {
                // enumerator 2 has the "lowest" item
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
            else
            {
                // they're considered equivalent, only yield it once
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
                more2 = enumerator2.MoveNext();
            }
        }

        // Yield rest of values from non-exhausted source
        while (more1)
        {
            yield return enumerator1.Current;
            more1 = enumerator1.MoveNext();
        }
        while (more2)
        {
            yield return enumerator2.Current;
            more2 = enumerator2.MoveNext();
        }
    }
}

Note that if one of the sources contains duplicates, you might see duplicates in the output. If you want to remove these duplicates in the already sorted lists, use the following method:

请注意,如果其中一个源包含重复项,您可能会在输出中看到重复项。如果要删除已排序列表中的这些重复项,请使用以下方法:

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)
{
    return CheapDistinct<T>(source, Comparer<T>.Default);
}

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
    IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source))
        throw new ArgumentNullException("source");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T> enumerator = source.GetEnumerator())
    {
        if (enumerator.MoveNext())
        {
            T item = enumerator.Current;

            // scan until different item found, then produce
            // the previous distinct item
            while (enumerator.MoveNext())
            {
                if (comparer.Compare(item, enumerator.Current) != 0)
                {
                    yield return item;
                    item = enumerator.Current;
                }
            }

            // produce last item that is left over from above loop
            yield return item;
        }
    }
}

Note that none of these will internally use a data structure to keep a copy of the data, so they will be cheap if the input is sorted. If you can't, or won't, guarantee that, you should use the 3.5 extension methods that you've already found.

请注意,这些都不会在内部使用数据结构来保留数据的副本,因此如果输入已排序,它们将很便宜。如果您不能或不会保证,您应该使用您已经找到的 3.5 扩展方法。

Here's example code that calls the above methods:

下面是调用上述方法的示例代码:

String[] list_1 = { "apple", "orange", "apple", "banana" };
String[] list_2 = { "banana", "pear", "grape" };

Array.Sort(list_1);
Array.Sort(list_2);

IEnumerable<String> items = Merge(
    CheapDistinct(list_1),
    CheapDistinct(list_2));
foreach (String item in items)
    Console.Out.WriteLine(item);

回答by danatel

You don't know which approach is faster until you measure it. The LINQ way is elegant and easy to understand.

在测量之前,您不知道哪种方法更快。LINQ 方式优雅且易于理解。

Another way is to implement an set as an hash array (Dictionary) and add all the elements of both the arrays to the set. Then use set.Keys.ToArray() method to create the resulting array.

另一种方法是将集合实现为散列数组(字典),并将两个数组的所有元素都添加到集合中。然后使用 set.Keys.ToArray() 方法创建结果数组。

回答by TheSoftwareJedi

.NET 3.5 introduced the HashSet class which could do this:

.NET 3.5 引入了可以执行此操作的 HashSet 类:

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

Not sure of performance, but it should beat the Linq example you gave.

不确定性能,但它应该击败您提供的 Linq 示例。

EDIT: I stand corrected. The lazy implementation of Concat and Distinct have a key memory AND speed advantage. Concat/Distinct is about 10% faster, and saves multiple copies of data.

编辑:我已经纠正了。Concat 和 Distinct 的惰性实现具有关键的内存和速度优势。Concat/Distinct 速度提高了大约 10%,并且可以保存多个数据副本。

I confirmed through code:

我通过代码确认:

Setting up arrays of 3000000 strings overlapping by 300000
Starting Hashset...
HashSet: 00:00:02.8237616
Starting Concat/Distinct...
Concat/Distinct: 00:00:02.5629681

is the output of:

是输出:

        int num = 3000000;
        int num10Pct = (int)(num / 10);

        Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));
        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();

        Console.WriteLine("Starting Hashset...");
        Stopwatch sw = new Stopwatch();
        sw.Start();
        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
        sw.Stop();
        Console.WriteLine("HashSet: " + sw.Elapsed);

        Console.WriteLine("Starting Concat/Distinct...");
        sw.Reset();
        sw.Start();
        string[] merged2 = list1.Concat(list2).Distinct().ToArray();
        sw.Stop();
        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

回答by Jon Skeet

Why do you imagine that it would be inefficient? As far as I'm aware, both Concat and Distinct are evaluated lazily, using a HashSet behind the scenes for Distinct to keep track of the elements which have already been returned.

为什么你认为它会效率低下?据我所知,Concat 和 Distinct 都是惰性求值的,在 Distinct 的幕后使用 HashSet 来跟踪已经返回的元素。

I'm not sure how you'd manage to make it more efficient than that in a general way :)

我不确定您如何设法使其比一般方式更有效率:)

EDIT: Distinct actually uses Set (an internal class) instead of HashSet, but the gist is still correct. This is a really good example of just how neat LINQ is. The simplest answer is pretty much as efficient as you can achieve without more domain knowledge.

编辑:Distinct 实际上使用 Set(一个内部类)而不是 HashSet,但要点仍然是正确的。这是一个很好的例子,展示了 LINQ 的简洁性。最简单的答案几乎与您无需更多领域知识即可实现的效率一样高。

The effect is the equivalent of:

效果相当于:

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
{
    HashSet<T> returned = new HashSet<T>();
    foreach (T element in first)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
    foreach (T element in second)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
}