在.NET中有效合并字符串数组，保持不同的值-IGI

时间：2020-03-06 14:50:57 　来源:igfitidea点击:

我正在使用.NET 3.5. 我有两个字符串数组，它们可能共享一个或者多个值：

string[] list1 = new string[] { "apple", "orange", "banana" };
string[] list2 = new string[] { "banana", "pear", "grape" };

我想要一种将它们合并到一个没有重复值的数组中的方法：

{ "apple", "orange", "banana", "pear", "grape" }

我可以用LINQ做到这一点：

string[] result = list1.Concat(list2).Distinct().ToArray();

但是我想这对于大型阵列来说不是很有效。

有没有更好的办法？

解决方案

可能以值作为键来创建一个哈希表(仅添加那些不存在的键)，然后将键转换为数组可能是一个可行的解决方案。

免责声明这是过早的优化。对于示例数组，请使用3.5扩展方法。除非我们知道该区域存在性能问题，否则应使用库代码。

如果可以对数组进行排序，或者在代码中到达该点时对它们进行排序，则可以使用以下方法。

这些将从两者中提取一个，并产生"最低"的项目，然后从相应的来源中获取一个新的项目，直到两个来源都用尽。如果从两个来源获取的当前商品相等，它将从第一个来源获取一个商品，并在两个来源中跳过它们。

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2)
{
    return Merge(source1, source2, Comparer<T>.Default);
}

private static IEnumerable<T> Merge<T>(IEnumerable<T> source1,
    IEnumerable<T> source2, IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source1))
        throw new ArgumentNullException("source1");
    if (Object.ReferenceEquals(null, source2))
        throw new ArgumentNullException("source2");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T>
        enumerator1 = source1.GetEnumerator(),
        enumerator2 = source2.GetEnumerator())
    {
        Boolean more1 = enumerator1.MoveNext();
        Boolean more2 = enumerator2.MoveNext();

        while (more1 && more2)
        {
            Int32 comparisonResult = comparer.Compare(
                enumerator1.Current,
                enumerator2.Current);
            if (comparisonResult < 0)
            {
                // enumerator 1 has the "lowest" item
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
            }
            else if (comparisonResult > 0)
            {
                // enumerator 2 has the "lowest" item
                yield return enumerator2.Current;
                more2 = enumerator2.MoveNext();
            }
            else
            {
                // they're considered equivalent, only yield it once
                yield return enumerator1.Current;
                more1 = enumerator1.MoveNext();
                more2 = enumerator2.MoveNext();
            }
        }

        // Yield rest of values from non-exhausted source
        while (more1)
        {
            yield return enumerator1.Current;
            more1 = enumerator1.MoveNext();
        }
        while (more2)
        {
            yield return enumerator2.Current;
            more2 = enumerator2.MoveNext();
        }
    }
}

请注意，如果其中一个来源包含重复项，则我们可能会在输出中看到重复项。如果要在已排序的列表中删除这些重复项，请使用以下方法：

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source)
{
    return CheapDistinct<T>(source, Comparer<T>.Default);
}

private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source,
    IComparer<T> comparer)
{
    #region Parameter Validation

    if (Object.ReferenceEquals(null, source))
        throw new ArgumentNullException("source");
    if (Object.ReferenceEquals(null, comparer))
        throw new ArgumentNullException("comparer");

    #endregion

    using (IEnumerator<T> enumerator = source.GetEnumerator())
    {
        if (enumerator.MoveNext())
        {
            T item = enumerator.Current;

            // scan until different item found, then produce
            // the previous distinct item
            while (enumerator.MoveNext())
            {
                if (comparer.Compare(item, enumerator.Current) != 0)
                {
                    yield return item;
                    item = enumerator.Current;
                }
            }

            // produce last item that is left over from above loop
            yield return item;
        }
    }
}

请注意，这些都不会在内部使用数据结构来保留数据的副本，因此如果对输入进行排序，它们将很便宜。如果不能保证，则应该使用已经发现的3.5扩展方法。

这是调用上述方法的示例代码：

String[] list_1 = { "apple", "orange", "apple", "banana" };
String[] list_2 = { "banana", "pear", "grape" };

Array.Sort(list_1);
Array.Sort(list_2);

IEnumerable<String> items = Merge(
    CheapDistinct(list_1),
    CheapDistinct(list_2));
foreach (String item in items)
    Console.Out.WriteLine(item);

我们不知道哪种方法会更快，直到我们对其进行衡量。 LINQ方式优雅且易于理解。

另一种方法是将集合实现为哈希数组(字典)，并将两个数组的所有元素添加到集合中。然后使用set.Keys.ToArray()方法创建结果数组。

.NET 3.5引入了HashSet类，它可以做到这一点：

IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

不确定性能，但是它应该超过我们提供的Linq示例。

编辑：
我站得住了。 Concat和Distinct的惰性实现具有关键的内存和速度优势。 Concat / Distinct的速度提高了约10％，并保存了多个数据副本。

我通过代码确认：

Setting up arrays of 3000000 strings overlapping by 300000
Starting Hashset...
HashSet: 00:00:02.8237616
Starting Concat/Distinct...
Concat/Distinct: 00:00:02.5629681

是以下内容的输出：

int num = 3000000;
        int num10Pct = (int)(num / 10);

        Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct));
        string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray();
        string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray();

        Console.WriteLine("Starting Hashset...");
        Stopwatch sw = new Stopwatch();
        sw.Start();
        string[] merged = new HashSet<string>(list1).Union(list2).ToArray();
        sw.Stop();
        Console.WriteLine("HashSet: " + sw.Elapsed);

        Console.WriteLine("Starting Concat/Distinct...");
        sw.Reset();
        sw.Start();
        string[] merged2 = list1.Concat(list2).Distinct().ToArray();
        sw.Stop();
        Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

我们为什么会认为效率低下？据我所知，对Concat和Distinct的评估都是惰性的，在Distinct的幕后使用HashSet来跟踪已经返回的元素。

我不确定我们将如何通过一般方式使其效率更高:)

编辑：Distinct实际上使用Set(内部类)而不是HashSet，但是要旨仍然是正确的。这是LINQ多么整洁的一个很好的例子。如果没有更多的领域知识，最简单的答案几乎可以达到我们所需要的效率。

效果等同于：

public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second)
{
    HashSet<T> returned = new HashSet<T>();
    foreach (T element in first)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
    foreach (T element in second)
    {
        if (returned.Add(element))
        {
            yield return element;
        }
    }
}

string[] result = list1.Union(list2).ToArray();

来自msdn："此方法从返回集中排除重复项。这与Concat(TSource)方法是不同的行为，该方法返回输入序列中的所有元素，包括重复项。"

在.NET中有效合并字符串数组，保持不同的值

解决方案

相关推荐

最近更新

标签

在.NET中有效合并字符串数组，保持不同的值

解决方案

相关推荐

Emacs：如何存储用户提供的最后一个参数作为默认值？

在RDBMS中存储完整的图形

Weblogic加载项以序列化Web服务调用

是否可以通过编程方式更改用户的屏幕保护程序和/或者桌面背景？

相关推荐

最近更新

标签